大模型训练 Learning rate warmup, cosine decay and gradient clipping
1. 学习率Warm up
在训练复杂的模型时,使用学习率热身可以帮助训练稳定。在学习率热身中,我们逐渐增加学习率,从一个非常低的值inital_lr
逐渐到用户定义的最大学习率peak_lr
。
n_epochs = 15
initial_lr = 0.0001
peak_lr = 0.01
total_steps = len(train_loader) * n_epochs
warmup_steps = int(0.2 * total_steps) # 20% warmup
print(warmup_steps)
2. 余弦退火
在达到最高学习率后,不断降低到min_lr,这是通过余弦函数来实现的,最开始的余弦函数是cos0=1,到最后是cospi = -1,随着迭代次数增加,学习率慢慢递减。
import math
min_lr = 0.1 * initial_lr
track_lrs = []
lr_increment = (peak_lr - initial_lr) / warmup_steps
global_step = -1
for epoch in range(n_epochs):
for input_batch, target_batch in train_loader:
optimizer.zero_grad()
global_step += 1
# Adjust the learning rate based on the current phase (warmup or cosine annealing)
if global_step < warmup_steps:
# Linear warmup
lr = initial_lr + global_step * lr_increment
else:
# Cosine annealing after warmup
progress = ((global_step - warmup_steps) /
(total_training_steps - warmup_steps))
lr = min_lr + (peak_lr - min_lr) * 0.5 * (
1 + math.cos(math.pi * progress))
# Apply the calculated learning rate to the optimizer
for param_group in optimizer.param_groups:
param_group["lr"] = lr
track_lrs.append(optimizer.param_groups[0]["lr"])
# Calculate loss and update weights
3. 梯度裁剪
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
使用clip_grad_norm
可以根据L2函数,将梯度的L2范数裁剪到max_norm,方法是直接除。
原文地址:https://blog.csdn.net/huoshanshaohui/article/details/142550058
免责声明:本站文章内容转载自网络资源,如本站内容侵犯了原著者的合法权益,可联系本站删除。更多内容请关注自学内容网(zxcms.com)!