pytorch_lightning笔记
Debug
1. 快速运行一次所有的代码 (fast_dev_run)
训练了好长时间但是在训练or 验证的时候崩溃了 使用 fast_dev_run运行5个batch 的 training validation test and predication 查看是否存在错误:
train = Trainer(fast_dev_run=True) # True 时为5
train = Trainer(fast_dev_run=7) # 可以调节为任意int值
2.缩短epoch的长度 (limit_xxx_batch)
有时仅使用training or validation or … 是helpful的 例如在Imagenet等较大的数据集上,比等待complete epoch faster
train = Trainer(limit_train_batch=0.1, limit_val_batch=0.01) # 10% and 1%
train = Trainer(limit_train_batch=10, limit_val_batch=5) # 10 batches and 5 batches
3. 打印输入输出层尺寸(example_input_array)
class LitModel(LightningModule):
def __init__(self, *args, **kwargs):
self.example_input_array = torch.Tensor(32, 1, 28, 28)
summary table 将会输出包括 input and output 的 dimensions
| Name | Type | Params | Mode | In sizes | Out sizes
----------------------------------------------------------------------
0 | net | Sequential | 132 K | train | [10, 256] | [10, 512]
1 | net.0 | Linear | 131 K | train | [10, 256] | [10, 512]
2 | net.1 | BatchNorm1d | 1.0 K | train | [10, 512] | [10, 512]
发现 bottlenecks (profiler)
1. 查看时间(profiler)
trainer = Trainer(profiler="simple") # 测量训练循环中的所有方法
# output for simple
FIT Profiler Report
-------------------------------------------------------------------------------------------
| Action | Mean duration (s) | Total time (s) |
-------------------------------------------------------------------------------------------
| [LightningModule]BoringModel.prepare_data | 10.0001 | 20.00 |
| run_training_epoch | 6.1558 | 6.1558 |
| run_training_batch | 0.0022506 | 0.015754 |
| [LightningModule]BoringModel.optimizer_step | 0.0017477 | 0.012234 |
| [LightningModule]BoringModel.val_dataloader | 0.00024388 | 0.00024388 |
| on_train_batch_start | 0.00014637 | 0.0010246 |
| [LightningModule]BoringModel.teardown | 2.15e-06 | 2.15e-06 |
| [LightningModule]BoringModel.on_train_start | 1.644e-06 | 1.644e-06 |
| [LightningModule]BoringModel.on_train_end | 1.516e-06 | 1.516e-06 |
| [LightningModule]BoringModel.on_fit_end | 1.426e-06 | 1.426e-06 |
| [LightningModule]BoringModel.setup | 1.403e-06 | 1.403e-06 |
| [LightningModule]BoringModel.on_fit_start | 1.226e-06 | 1.226e-06 |
-------------------------------------------------------------------------------------------
trainer = Trainer(profiler="advanced") # 测量每个function的时间
# output for advanced
Profiler Report
Profile stats for: get_train_batch
4869394 function calls (4863767 primitive calls) in 18.893 seconds
Ordered by: cumulative time
List reduced from 76 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
3752/1876 0.011 0.000 18.887 0.010 {built-in method builtins.next}
1876 0.008 0.000 18.877 0.010 dataloader.py:344(__next__)
1876 0.074 0.000 18.869 0.010 dataloader.py:383(_next_data)
1875 0.012 0.000 18.721 0.010 fetch.py:42(fetch)
1875 0.084 0.000 18.290 0.010 fetch.py:44(<listcomp>)
60000 1.759 0.000 18.206 0.000 mnist.py:80(__getitem__)
60000 0.267 0.000 13.022 0.000 transforms.py:68(__call__)
60000 0.182 0.000 7.020 0.000 transforms.py:93(__call__)
60000 1.651 0.000 6.839 0.000 functional.py:42(to_tensor)
60000 0.260 0.000 5.734 0.000 transforms.py:167(__call__)
# 如果探查器报告变得太长,您可以将报告流式传输到文件
from lightning.pytorch.profilers import AdvancedProfiler
profiler = AdvancedProfiler(dirpath=".", filename="perf_logs")
trainer = Trainer(profiler=profiler)
highlevel usage:
https://lightning.ai/docs/pytorch/stable/tuning/profiler_intermediate.html
2. 查看accelerator的使用情况 (DeviceStatsMonitor)
检测瓶颈的另一个有用技术是确保您使用加速器 (GPU/TPU/HPU) 的全部容量。
from lightning.pytorch.callbacks import DeviceStatsMonitor
trainer = Trainer(callbacks=[DeviceStatsMonitor()])
SOTA find
https://lightning.ai/docs/pytorch/stable/advanced/training_tricks.html
原文地址:https://blog.csdn.net/Fools_______/article/details/142863664
免责声明:本站文章内容转载自网络资源,如本站内容侵犯了原著者的合法权益,可联系本站删除。更多内容请关注自学内容网(zxcms.com)!