pytorch显存管理_前向传播中间激活存储（intermediate activation）

🕗 发布于 2024-11-26 16:20 pytorch 深度学习 人工智能

一激活值存储

# pytorch显存管理、前向传播中间激活存储（intermediate activation）和torch.utils.checkpoint
如何理解激活值留在显存中，可以通过以下例子
测试代码

import torch
from torch.utils.checkpoint import checkpoint

class MyModel(torch.nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.net1 = torch.nn.Linear(3, 300)
        self.net2 = torch.nn.Linear(300, 300)
        self.net3 = torch.nn.Linear(300, 400)
        self.net4 = torch.nn.Linear(400, 300)
        self.net5 = torch.nn.Linear(300, 100)
        self.activation_sum = 0
        self.activation_size = 0

    def forward(self, x):
        x = self.net1(x)
        self.activation_sum += x.nelement()
        self.activation_size += (x.nelement() * x.element_size())
        x = self.net2(x)
        self.activation_sum += x.nelement()
        self.activation_size += (x.nelement() * x.element_size())
        x = self.net3(x)
        self.activation_sum += x.nelement()
        self.activation_size += (x.nelement() * x.element_size())
        x = self.net4(x)
        self.activation_sum += x.nelement()
        self.activation_size += (x.nelement() * x.element_size())
        x = self.net5(x)
        self.activation_sum += x.nelement()
        self.activation_size += (x.nelement() * x.element_size())
        return x
def modelSize(model):
    param_size = 0
    param_sum = 0
    for param in model.parameters():
        param_size += param.nelement() * param.element_size()
        param_sum += param.nelement()
    buffer_size = 0
    buffer_sum = 0
    for buffer in model.buffers():
        buffer_size += buffer.nelement() * buffer.element_size()
        buffer_sum += buffer.nelement()
    all_size = (param_size + buffer_size)
    return all_size

device = torch.device("cuda:0")

input = torch.randn(10, 3).to(device)
label = torch.randn(10, 100).to(device)

torch.cuda.empty_cache()
before = torch.cuda.memory_allocated()
model = MyModel().to("cuda:0")
after = torch.cuda.memory_allocated()
print("建立模型前显存{}".format( before))
print("建立模型后显存{}".format(after ))
print("建立模型后显存变大{}".format(after - before))

print("模型大小为{}".format(modelSize(model)))

loss_fn = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
model.train()
optimizer.zero_grad()

before = torch.cuda.memory_allocated()
print("模型前向传播前使用显存为{}".format(before))

output = model(input)  # 前向传播

after = torch.cuda.memory_allocated()
print("模型前向传播后使用显存为{}，差值（中间激活）为{}".format(after, after - before))

loss = loss_fn(output, label)
torch.autograd.backward(loss)
optimizer.step()

建立模型前显存4608
建立模型后显存1457152
建立模型后显存变大1452544
模型大小为1449200
模型前向传播前使用显存为1457152
模型前向传播后使用显存为1514496，差值（中间激活）为57344

二模型存储

其中，关于模型大小的model.parameters()与model.buffer()内容，可以查看
# Pytorch模型中的parameter与buffer
# pytorch中的parameter与buffer
简而言之，反向传播中需要被更新的是model.parameters()，不需要更新的是model.buffer()
比如：Transformer中word embedding需要更新，使用model.parameters()，而 position embedding不需要更新，使用model.buffer()

三显存分配

memory_allocated()：表示当前已分配并正在使用的内存。
memory_cached()：表示已经分配但当前未使用的内存，处于缓存状态，可能会被重用或释放。

import torch

# 设定设备为GPU
device = torch.device('cuda')

# 第一步：模拟内存分配
tensor1 = torch.randn(1024, 1024, device=device)  # 1MB 的数据
tensor2 = torch.randn(2048, 2048, device=device)  # 4MB 的数据

# 打印分配和缓存内存
print("Before empty_cache:")
print(f"Allocated Memory: {torch.cuda.memory_allocated()} bytes")
print(f"Cached Memory: {torch.cuda.memory_reserved()} bytes")

# 第二步：释放 tensor1
del tensor1

# 打印分配和缓存内存（仍然会保留一部分内存）
print("\nAfter deleting tensor1 (without empty_cache):")
print(f"Allocated Memory: {torch.cuda.memory_allocated()} bytes")
print(f"Cached Memory: {torch.cuda.memory_reserved()} bytes")

# 删除 tensor2 以确保没有其他变量占用内存
del tensor2

# 第三步：清理缓存
torch.cuda.empty_cache()

# 打印分配和缓存内存（应该会减少缓存内存）
print("\nAfter calling empty_cache:")
print(f"Allocated Memory: {torch.cuda.memory_allocated()} bytes")
print(f"Cached Memory: {torch.cuda.memory_reserved()} bytes")

结果

Before empty_cache:
Allocated Memory: 20971520 bytes
Cached Memory: 20971520 bytes

After deleting tensor1 (without empty_cache):
Allocated Memory: 16777216 bytes
Cached Memory: 20971520 bytes

After calling empty_cache:
Allocated Memory: 0 bytes
Cached Memory: 0 bytes

没有显式释放内存：虽然你删除了 data，但如果在前向传播中没有做显式的内存释放操作（如 torch.cuda.empty_cache()），GPU 会继续保留内存缓存，系统并不会自动回收未使用的缓存内存。
PyTorch 的内存管理机制：PyTorch 默认不会释放未使用的内存，而是会将其标记为缓存，用于后续的操作。这是为了减少频繁的显存分配与释放的开销，所以即使内存不再被使用，缓存的内存可能仍然会保持不变。

爆显存 是指 memory_allocated() 超过了 GPU 显存的最大容量，这通常会导致程序崩溃或报错（如 “CUDA out of memory”）。
memory_cached() 超过显存并不会导致爆显存，因为这部分内存是缓存，可以被框架动态回收或重新分配

四 python中内存管理

python中设置为None并不能清空内存

1 设置为None的作用

将这些变量赋值为 None 只是清除了它们在 Python 内存中的引用，而并不会立即释放显存。仅仅删除 Python 中的引用并不意味着 GPU 显存会立刻被释放。

2 显存无法释放的原因

即使你将变量赋值为 None，如果这些变量在其他地方仍然被引用，或者有其他地方保存了该变量的引用（比如在计算图中或缓存中），显存也不会被释放。在框架中，张量（tensor）通常在计算图中仍然保持引用，直到反向传播或者其他操作完成

举例

import torch

# 设定设备为GPU
device = torch.device('cuda')

# 第一步：模拟内存分配
tensor1 = torch.randn(1024, 1024, device=device)  # 1MB 的数据
tensor2 = torch.randn(2048, 2048, device=device)  # 4MB 的数据

# 打印分配和缓存内存
print("Before empty_cache:")
print(f"Allocated Memory: {torch.cuda.memory_allocated()} bytes")
print(f"Cached Memory: {torch.cuda.memory_reserved()} bytes")

tensor3 =tensor1  # 4MB 的数据
tensor4 =tensor2  # 4MB 的数据
print("Before empty_cache3:")
print(f"Allocated Memory: {torch.cuda.memory_allocated()} bytes")
print(f"Cached Memory: {torch.cuda.memory_reserved()} bytes")

# 第二步：释放 tensor1
del tensor1

# 打印分配和缓存内存（仍然会保留一部分内存）
print("\nAfter deleting tensor1 (without empty_cache):")
print(f"Allocated Memory: {torch.cuda.memory_allocated()} bytes")
print(f"Cached Memory: {torch.cuda.memory_reserved()} bytes")

# 删除 tensor2 以确保没有其他变量占用内存
del tensor2

# 第三步：清理缓存
torch.cuda.empty_cache()

# 打印分配和缓存内存（应该会减少缓存内存）
print("\nAfter calling empty_cache:")
print(f"Allocated Memory: {torch.cuda.memory_allocated()} bytes")
print(f"Cached Memory: {torch.cuda.memory_reserved()} bytes")

打印语句

Before empty_cache:
Allocated Memory: 20971520 bytes
Cached Memory: 20971520 bytes
Before empty_cache3:
Allocated Memory: 20971520 bytes
Cached Memory: 20971520 bytes

After deleting tensor1 (without empty_cache):
Allocated Memory: 20971520 bytes
Cached Memory: 20971520 bytes                                                                                                                                                                                     parallel-size', 

After calling empty_cache:                                                                                                                                                                                        parallel-size', 
Allocated Memory: 20971520 bytes
Cached Memory: 20971520 bytes

其中没有一个显存释放。
只有同时删除

# 第二步：释放 tensor1
del tensor1
del tensor3

或者

tensor1 = None
tensor3 = None

才能清除变量所占用的内存空间

原文地址：https://blog.csdn.net/m0_49448331/article/details/144021865

免责声明：本站文章内容转载自网络资源，如本站内容侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

上一篇：数据结构每日一题|判断链表环形结构并返回环的起始节点
下一篇：CKA认证 | Day5 K8s调度

百度智能云千帆部署流程---语音识别和合成
实现整个流程如下图，但是我们的工作量并不是很多，我们可以在官网找到示例代码一、前期准备这里我们使用到3个代码API_KEY.py 填写我们的APIxzarm_asr.py 语音识别
阅读更多2024-11-29
windows下用mysqld启动免安装mysql
windows系统可以下载免安装版本，就是绿色版，里面包含mysql运行的所有必要条件。
阅读更多2024-11-29
后端 Java发送邮件 JavaMail 模版 20241128测试可用
【代码】后端 Java发送邮件 JavaMail 模版 20241128测试可用。
阅读更多2024-11-29
Matlab Simulink HDL Coder开发流程（一）— 创建HDL兼容的Simulink模型
这个例子说明了如何创建一个用于生成HDL代码的Simulink模型。要创建兼容HDL代码生成的MATLAB算法，请参见“Guidelines for Writing MATLAB Code to Ge
阅读更多2024-11-29
开发一套ERP 第七弹 RUst 操作数据库
【代码】开发一套ERP 第七弹 RUst 操作数据库。
阅读更多2024-11-29
QINQ技术
QINQ即802.1q in 802.1q，因为IEEE802.1Q中定义的Vlan Tag域只有12个比特，仅能表示4096个Vlan，随网络发展被用尽，于是在原有带vlan的数据上再携带一层vla
阅读更多2024-11-29
Rook入门：打造云原生Ceph存储的全面学习路径(上)
Rook入门：打造云原生Ceph存储的全面学习路径(上)
阅读更多2024-11-29
【云原生系列】迁移云上需要考虑哪些问题
云计算已经成为现代企业架构中不可或缺的一部分。越来越多的公司正在将他们的应用、数据和基础设施迁移到云平台上，以便更好地应对快速变化的市场需求、提高运营效率并降低成本。然而，迁移到云端并不是一件轻松的事
阅读更多2024-11-29
宠物领养平台开发：SpringBoot实战
对于数据表的设计，我先是在图书馆借阅了一本数据库方面的书籍进行查看，然后查看相似系统对于数据表的结构设计等知识，然后在本系统功能确定的情况下，结合本系统设计了配套的数据表，对于难度最大的开发技术部分，
阅读更多2024-11-29
宠物领养技术：SpringBoot框架应用
MySQL的数据存放形式从大向小的说是数据库最大，然后是表，每个表里面存放数据是有一定的规则的，数据存放是表格形式的，也就是说有横也有竖，横着的为行，一般表示一条数据，每个表都有字段，而字段是以列的形
阅读更多2024-11-29

pytorch显存管理_前向传播中间激活存储（intermediate activation）

一 激活值存储

二 模型存储

三 显存分配