Exponential Moving Average (EMA) in Stable Diffusion

🕗 发布于 2024-07-26 14:51 stable diffusion EMA

1.Moving Average in Stable Diffusion (SMA&EMA)

1.Moving average
2.移动平均值
3.How We Trained Stable Diffusion for Less than $50k (Part 3)

Moving Average
在统计学中，移动平均是通过创建整个数据集中不同选择的一系列平均值来分析数据点的计算。

给定一数字序列和固定子集大小，移动平均值的第一个元素是通过对数字序列的初始固定子集求平均值而获得的。然后通过“前移”的方式修改子集；也就是说，排除系列的第一个数字并包括子集中的下一个值。

移动平均的理解，来自移动平均值

1.1 Simple Moving Average（SMA，an unweighted MA）

1.2 Exponential Moving Average (EMA，a weighted MA)

In the context of Stable Diffusion, the Exponential Moving Average (EMA) is a technique used during the training of machine learning models, particularly neural networks, to stabilize and improve the model’s performance.

The Exponential Moving Average is a method of averaging that gives more weight to recent data points, making it more responsive to recent changes compared to a simple moving average, which treats all data points equally.

1.2.1 EMA in Stable Diffusion

In the context of Stable Diffusion, EMA is applied to the model parameters during training to create a smoothed version of the model. This is particularly useful in machine learning because the training process can be noisy, with the model parameters oscillating as they converge towards an optimal solution. By maintaining an EMA of the model parameters, the training process can benefit from the following:

Smoothing: EMA smooths out the parameter updates, reducing the impact of noise and making the training process more stable.
Better Generalization: The EMA version of the model often generalizes better on unseen data compared to the model with the raw parameters. This is because EMA tends to favor parameter values that are more consistent over time.
Preventing Overfitting: By averaging the parameters over time, EMA can help mitigate overfitting, especially in cases where the model might otherwise converge too quickly to a suboptimal solution.

笔者个人理解
代价函数（loss function）是关于参数（weight&bias）的函数，也就是说一个loss值对应一组参数值，loss值表现为震荡，也就是说模型参数也在变化。在训练SD时的MSE Loss在梯度下降过程中是上下震荡的，对应的模型参数也在震荡，可以用EMA取得这些模型参数震荡值的中间值，这个模型参数的中间值也就能更好的代表所有时刻模型参数的平均水平，让模型获得了更好的泛化能力

Stable Diffusion 2 uses Exponential Moving Averaging (EMA), which maintains an exponential moving average of the weights. At every time step, the EMA model is updated by taking 0.9999 times the current EMA model plus 0.0001 times the new weights after the latest forward and backward pass. By default, the EMA algorithm is applied after every gradient update for the entire training period. However, this can be slow due to the memory operations required to read and write all the weights at every step.
每个时间步都对所有参数进行EMA代价较大，因为要在每个时刻读写模型的全部参数
$\text{EMA}_t=0.0001\cdot x_t+0.9999\cdot \text{EMA}_{t-1}$
为了使得计算EMA代价减小，我们仅仅采取在最后时间段进行EMA计算
To avoid this costly procedure, we start with a key observation: since the old weights are decayed by a factor of 0.9999 at every batch, the early iterations of training only contribute minimally to the final average. This means we only need to take the exponential moving average of the final few steps. Concretely, we train for 1,400,000 batches and only apply EMA for the final 50,000 steps, which is about 3.5% of the training period. The weights from the first 1,350,000 iterations decay away by (0.9999)^50000, so their aggregate contribution would have a weight of less than 1% in the final model. Using this technique, we can avoid adding overhead for 96.5% of training and still achieve a nearly equivalent EMA model.

1.2.2 Implementation in Stable Diffusion

During the training of a diffusion model, the EMA of the model’s weights is updated alongside the regular updates. Here’s a typical process:

Initialize EMA Weights: At the start of training, initialize the EMA weights to be the same as the model’s initial weights.
Update During Training: After each batch update, update the EMA weights using the formula mentioned above. This requires storing a separate set of weights for the EMA.
Use for Inference: At the end of the training, use the EMA weights for inference instead of the raw model weights. This is because the EMA weights represent a more stable and potentially better-performing version of the model.

1.2.3 Practical Considerations

Choosing $\alpha$ ：The smoothing factor $\alpha$ is a hyperparameter that needs to be chosen carefully. A common practice is to set $\alpha$ based on the number of iterations or epochs, such as $\alpha=\frac{2}{N+1}$ where $N$ is the number of iterations
Performance Overhead: Maintaining EMA weights requires additional memory and computational overhead, but the benefits in terms of model stability and performance often outweigh these costs.

module.py

class EMA:
# Initializes the EMA object with a smoothing factor (beta) and a step counter (step).
    def __init__(self, beta):
        super().__init__()
        self.beta = beta  # Smoothing factor for the exponential moving average
        self.step = 0  # Step counter to keep track of the number of updates
# Updates the moving average of the parameters of the EMA model (ma_model) based on the current model (current_model)
    def update_model_average(self, ma_model, current_model):
        # Update the moving average (EMA) of model parameters
        for current_params, ma_params in zip(current_model.parameters(), ma_model.parameters()):
            old_weight, up_weight = ma_params.data, current_params.data
            # Update the moving average of the parameters
            ma_params.data = self.update_average(old_weight, up_weight)
# Computes the exponentially weighted average of the old and new parameters.
    def update_average(self, old, new):
        # Compute the updated average
        if old is None:
            return new
        return old * self.beta + (1 - self.beta) * new
# Either resets the EMA model parameters to match the current model parameters 
# if the step count is less than step_start_ema, 
# or updates the EMA model parameters based on the current model parameters. 
# It increments the step counter after each call.
    def step_ema(self, ema_model, model, step_start_ema=2000):
        # Update EMA model parameters or reset them based on the step count
        if self.step < step_start_ema:
            self.reset_parameters(ema_model, model)
        else:
            self.update_model_average(ema_model, model)
        self.step += 1  # Increment the step counter
# Copies the current model's parameters to the EMA model to initialize the EMA model parameters
    def reset_parameters(self, ema_model, model):
        # Initialize EMA model parameters to be the same as the current model's parameters
        ema_model.load_state_dict(model.state_dict())

train.py

def train(args):
    device = args.device  # Get the device to run the training on
    model = UNET().to(device)   # Initialize the model and move it to the device
    model.train()
    optimizer = optim.AdamW(model.parameters(), lr=args.lr)  # set up the optimizer with AdamW
    mse = nn.MSELoss()  # Mean Squared Error loss function
    logger = SummaryWriter(os.path.join("runs", args.run_name))
    len_train = len(train_loader)
# EMA:Exponential Moving Average
    ema = EMA(0.995)  # Exponential Moving Average with decay rate 0.995
# At the start of training, initialize the EMA weights to be the same as the model’s initial weights.
    ema_model = copy.deepcopy(model).eval().requires_grad_(False)  # Create a copy of the model for EMA, set to eval mode and no gradients
    print('Start into the loop !')
    for epoch in range(args.epochs):
        logging.info(f"Starting epoch {epoch}:")  # log the start of the epoch
        progress_bar = tqdm(train_loader)  # progress bar for the dataloader
        optimizer.zero_grad()  # Explicitly zero the gradient buffers
        accumulation_steps = 4
        # Load all data into a batch
        for batch_idx, (images, captions) in enumerate(progress_bar):
            images = images.to(device)  # move images to the device
            # The dataloaer will add a batch size dimension to the tensor, but I've already added batch size to the VAE
            # and CLIP input, so we're going to remove a batch size and just keep the batch size of the dataloader
            images = torch.squeeze(images, dim=1)
            captions = captions.to(device)  # move caption to the device
            text_embeddings = torch.squeeze(captions, dim=1) # squeeze batch_size
            timesteps = ddpm_sampler.sample_timesteps(images.shape[0]).to(device)  # Sample random timesteps
            noisy_latent_images, noises = ddpm_sampler.add_noise(images, timesteps)  # Add noise to the images
            time_embeddings = timesteps_to_time_emb(timesteps)
            # x_t (batch_size, channel, Height/8, Width/8) (bs,4,256/8,256/8)
            # caption (batch_size, seq_len, dim) (bs, 77, 768)
            # t (batch_size, channel) (batch_size, 1280)
            # (bs,320,H/8,W/8)
            with torch.no_grad():
                last_decoder_noise = model(noisy_latent_images, text_embeddings, time_embeddings)
            # (bs,4,H/8,W/8)
            final_output = diffusion.final.to(device)
            predicted_noise = final_output(last_decoder_noise).to(device)
            loss = mse(noises, predicted_noise)  # Compute the loss
            loss.backward()  # Backpropagate the loss
            if (batch_idx + 1) % accumulation_steps == 0:  # Wait for several backward passes
                optimizer.step()  # Now we can do an optimizer step
                optimizer.zero_grad()  # Reset gradients to zero
# EMA:Exponential Moving Average
    ema.step_ema(ema_model, model)
            progress_bar.set_postfix(MSE=loss.item())  # Update the progress bar with the loss
            # log the loss to TensorBoard
            logger.add_scalar("MSE", loss.item(), global_step=epoch * len_train + batch_idx)
        # Save the model checkpoint
        os.makedirs(os.path.join("models", args.run_name), exist_ok=True)
        torch.save(model.state_dict(), os.path.join("models", args.run_name, f"stable_diffusion.ckpt"))
        torch.save(optimizer.state_dict(),
                   os.path.join("models", args.run_name, f"optim.pt"))  # Save the optimizer state

原文地址：https://blog.csdn.net/weixin_48524215/article/details/140687147

免责声明：本站文章内容转载自网络资源，如本站内容侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

上一篇：UDP进行端口转发时，数据丢失率太高怎么办
下一篇：新智慧：企元数智呈现全新新零售合规分销系统免费送

李宏毅机器学习2023-HW13-Network Compression
李宏毅机器学习2023-HW13，network compression完成图片分类
阅读更多2024-09-19
Microsoft Office LTSC 2024 离线安装ISO镜像
从Office 2019开始，微软官方仅提供ODT部署，不再提供传统的ISO安装镜像。为了方便大家安装Office LTSC 2024，即2019和2021之后，博主继续制作了Office LTSC
阅读更多2024-09-19
情感计算领域可以投稿的期刊与会议
这可能是因为从事情感计算研究的学者主要分为NLP和CV两个群体，他们更倾向于将研究成果投稿到各自领域的期刊和会议上。CVer叫面部表情识别。情感计算是一个多元化的研究领域，无论研究侧重于哪个方向，总有
阅读更多2024-09-19
基于vue框架的宠物托管系统设计与实现is203（程序+源码+数据库+调试部署+开发环境）系统界面在最后面。
为了改善这一现状，本项目旨在设计并实现一个基于Vue框架的宠物托管系统，通过现代信息技术手段，优化托管流程，提升服务质量，为宠物主人和托管机构提供一个高效、透明的交流平台。然而，当前市场上的宠物托管服
阅读更多2024-09-19
基于Spark的电影推荐系统设计与实现(论文+源码)_kaic
简而言之，搜索和推荐虽侧重不同，但都作为用户的助手，能帮助用户在冗杂庞大的数据面前为用户提供清晰的数据思路，让用户能使用关键词就能迅速检索信息，因此，在这种情况下，推荐系统显得尤为重要。但是，由于当时
阅读更多2024-09-19
ubuntu使用Vscode进行实现TCP编程
在Ubuntu上使用VSCode实现TCP编程的完整流程包括安装必要的工具、编写TCP客户端和服务器代码、编译和运行程序。这里将介绍如何使用Python或C语言编写简单的TCP客户端和服务器，并在VS
阅读更多2024-09-19
收藏好的项目
bin_android_demo 安卓综合应用示例 https://qtchina.blog.csdn.net/article/details/123940153。bin_ht
阅读更多2024-09-19
Java | Leetcode Java题解之第415题字符串相加
Java | Leetcode Java题解之第415题字符串相加
阅读更多2024-09-19
在CentOS 6上安装Ruby on Rails的方法
前些天发现了一个巨牛的，通俗易懂，风趣幽默，忍不住分享一下给大家。。
阅读更多2024-09-19
解决 npm ERR! node-sass 和 gyp ERR! node-gyp 报错问题
在对一个项目进行npm i的时候一直报错npm ERR!显示没有办法安装这个node-sass包。
阅读更多2024-09-19