李宏毅机器学习2023-HW13-Network Compression

🕗 发布于 2024-09-19 09:59 机器学习 人工智能 深度学习

文章目录

Task
Link
Baseline

Task

通过network compression完成图片分类，数据集跟hw3中11种食品分类一致。需要设计小模型student model，参数数目少于60k，训练该模型接近teacher model的精度test-Acc ≅ 0.902

Link

kaggle

Baseline

Simple Baseline

Just run the sample code

Medium Baseline

loss function定义为 KL divergence，公式如下：
$Loss=αT^2×KL(p||q)+(1−α)(Original Cross Entropy Loss),where \ p=softmax(\frac{\text{student's logits}}{T}),and\ q=softmax(\frac{\text{teacher's logits}}{T})$
同时epoch可以增加到50，其他地方标有#medium的也要修改

# medium
def loss_fn_kd(student_logits, labels, teacher_logits, alpha=0.5, temperature=5.0):
    # ------------TODO-------------
    # Refer to the above formula and finish the loss function for knowkedge distillation using KL divergence loss and CE loss.
    # If you have no idea, please take a look at the provided useful link above.
    student_prob = F.softmax(student_logits/temperature, dim=-1)
    teacher_prob = F.softmax(teacher_logits/temperature, dim=-1)
    KL_loss = (teacher_prob * (teacher_prob.log() - student_prob)).mean()
    CE_loss = nn.CrossEntropyLoss()(student_logits, labels)
    loss = alpha * temperature**2* KL_loss + (1 - alpha) * CE_loss
    return loss

Strong Baseline

用depth-wise and point-wise convolutions修改model architecture+增加epoch

other useful techniques
- group convolution (Actually, depthwise convolution is a specific type of group convolution)
- SqueezeNet
- MobileNet
- ShuffleNet
- Xception
- GhostNet

还可以应用中间层特征学习，这里我没有做

# strong
def dwpw_conv(in_channels, out_channels, kernel_size, stride=1, padding=0):
    return nn.Sequential(
        nn.Conv2d(in_channels, in_channels, kernel_size, stride=stride, padding=padding, groups=in_channels), #depthwise convolution
        nn.BatchNorm2d(in_channels),
        nn.ReLU(),
        nn.Conv2d(in_channels, out_channels, 1), # pointwise convolution
        nn.BatchNorm2d(out_channels),
        nn.ReLU(),
    )

Boss Baseline

Other advanced Knowledge Distillation（FitNet/RKD/DM) + 增加epoch + Depthwise & Pointwise Conv layer（深度可分离卷积）

当然也可以应用中间层特征学习

FitNet Knowledge Distillation

FitNet focuses on transferring knowledge from intermediate feature representations (hidden layers) instead of just using the output logits. The student model is trained to mimic the feature maps from certain layers of the teacher model.

#boss
def loss_fn_fitnet(teacher_feature, student_feature, labels, alpha=0.5):
    """
    FitNet Knowledge Distillation Loss Function.
    
    Args:
    - teacher_feature: The feature maps from a hidden layer of the teacher model.
    - student_feature: The feature maps from the corresponding hidden layer of the student model.
    - labels: Ground truth labels for the task.
    - alpha: Weighting factor for the feature distillation loss.
    
    Returns:
    - loss: Combined loss with cross-entropy and feature map alignment.
    """
    # Mean squared error loss to align feature maps of teacher and student
    feature_loss = F.mse_loss(student_feature, teacher_feature)

    # Hard label cross-entropy loss for the student output (classification)
    hard_loss = F.cross_entropy(student_feature, labels)
    
    # Combine both losses
    loss = alpha * hard_loss + (1 - alpha) * feature_loss
    return loss

Relational Knowledge Distillation (RKD)

Relational Knowledge Distillation focuses on transferring the relationships (distances and angles) between data samples as learned by the teacher. The student model is trained to match these relationships instead of just focusing on output probabilities.

# boss
def pairwise_distance(x):
    """Calculate pairwise distance between batch samples."""
    return torch.cdist(x, x, p=2)

def angle_between_pairs(x):
    """Calculate angles between all pairs of points in batch."""
    diff = x.unsqueeze(1) - x.unsqueeze(0)
    norm = diff.norm(dim=-1, p=2, keepdim=True)
    normalized_diff = diff / (norm + 1e-8)
    angles = torch.bmm(normalized_diff, normalized_diff.transpose(1, 2))
    return angles

def loss_fn_rkd(teacher_feature, student_feature, labels, alpha=0.5):
    """
    Relational Knowledge Distillation Loss Function.
    
    Args:
    - teacher_feature: Teacher model feature embeddings.
    - student_feature: Student model feature embeddings.
    - labels: Ground truth labels.
    - alpha: Weighting factor for relational distillation loss.
    
    Returns:
    - loss: Combined relational knowledge and hard label loss.
    """
    
    # Pairwise distances between features in the teacher and student model
    teacher_dist = pairwise_distance(teacher_feature)
    student_dist = pairwise_distance(student_feature)
    
    # Distillation loss using the L2 norm between relational distances
    distance_loss = F.mse_loss(student_dist, teacher_dist)

    # Angle-based loss between teacher and student feature vectors
    teacher_angle = angle_between_pairs(teacher_feature)
    student_angle = angle_between_pairs(student_feature)
    angle_loss = F.mse_loss(student_angle, teacher_angle)
    
    # Hard label cross-entropy loss for the student output
    hard_loss = F.cross_entropy(student_feature, labels)
    
    # Combine the losses
    loss = alpha * hard_loss + (1 - alpha) * (distance_loss + angle_loss)
    return loss

Distance Metric (DM) Knowledge Distillation

Distance Metric distillation focuses on transferring the distance metric (such as Euclidean distance or cosine similarity) between instances in the teacher’s feature space to the student model.

def loss_fn_dm(teacher_feature, student_feature, labels, alpha=0.5):
    """
    Distance Metric (DM) Knowledge Distillation Loss Function.
    
    Args:
    - teacher_feature: The feature representations from the teacher model.
    - student_feature: The feature representations from the student model.
    - labels: Ground truth labels for the task.
    - alpha: Weighting factor for distance metric loss.
    
    Returns:
    - loss: Combined distance metric loss and cross-entropy loss.
    """
    # Calculate pairwise distance between teacher and student embeddings
    teacher_dist = pairwise_distance(teacher_feature)
    student_dist = pairwise_distance(student_feature)
    
    # Distance metric loss using Mean Squared Error (MSE) loss
    dist_loss = F.mse_loss(student_dist, teacher_dist)
    
    # Hard label cross-entropy loss for the student's output
    hard_loss = F.cross_entropy(student_feature, labels)
    
    # Combine the losses
    loss = alpha * hard_loss + (1 - alpha) * dist_loss
    return loss

原文地址：https://blog.csdn.net/qq_42875127/article/details/142337542

免责声明：本站文章内容转载自网络资源，如本站内容侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

上一篇：等保测评后：企业如何持续保护信息安全
下一篇：Microsoft Office LTSC 2024 离线安装ISO镜像

Java学习，基本数据类型
System.out.println("最小值：Double.MIN_VALUE=" + Double.MIN_VALUE);System.out.println("最小
阅读更多2024-11-17
创建第一个react项目
通过以上步骤，你已经成功创建并运行了你的第一个React项目。接下来，你可以继续探索React的更多功能，编写更复杂的组件和应用程序。希望这个教程对你有所帮助！如果有任何问题，欢迎随时提问。参考资料R
阅读更多2024-11-17
从零开始的c++之旅——二叉搜索树
这与之前实现的二叉树类似，只不过用上了模板跟构造函数，因为构造函数我们在后面需要用来生成节点。K _key;:_key(key){}//这里也能体现封装思想，不管我们如何实现的类此处我们只需定义成No
阅读更多2024-11-17
c/c++内存管理
int main()// new/delete 和 malloc/free最大区别是 new/delete对于【自定义类型】除了开空间还会调用构造函数和析构函数free(p1);delete p2;/
阅读更多2024-11-17
1、PyTorch介绍与张量的创建
【代码】1、PyTorch介绍与张量的创建。
阅读更多2024-11-17
‌REST风格（Representational State Transfer）
REST风格的核心思想是将Web应用程序的功能作为资源来表示，使用统一的标识符（URI）来对这些资源进行操作，并通过HTTP协议（如GET、POST、PUT、DELETE等）来定义对这些资源的操作。‌
阅读更多2024-11-17
软件测试 —— 自动化基础
自动化是指自动的代替人的行为完成操作，自动化在生活中可以说是随处可见，如：自动洒水机、自动洗手液等，这些生活中的自动案例有效的减少了我们人力的消耗，同时也提高了我们的生活质量，在我们软件中的自动化测试
阅读更多2024-11-17
Python爬虫下载新闻，Flask展现新闻（2）
Python爬虫下载新闻和Flask展现新闻的主要技术
阅读更多2024-11-17
【CSS in Depth 2 精译_057】第九章 CSS 的模块化与作用域 + 9.1 CSS 模块的定义（上）
本篇为《CSS in Depth》全新第2版9.1小节内容的上篇，主要介绍了 CSS 模块化的产生背景及相关概念，并结合上一节层叠图层（cascade layer）的知识，通过一个简单的 messag
阅读更多2024-11-17
分布式事务seata基于docker安装和项目集成seata
分布式系统节点通过网络连接，一定会出现分区问题（P）当分区出现时,系统的一致性和可用性就无法同时满足cp-->不同节点的角色不同ap-->不同节点的角色相同。
阅读更多2024-11-17