一个卡跑的好慢，看着剩下的7张3090陷入沉思，计算力就是财富，对于一个刚入门的小白，又该如何利用分布式来加速你的训练呢？
本教程使用DDP框架
想使用其他框架比如DP，multiprossing之类的可以去其他地方找资料

介绍

为什么使用分布式训练

第一种是模型在一块GPU上放不下，两块或多块GPU上就能运行完整的模型（如早期的AlexNet）

第二种是多块GPU并行计算可以达到加速训练的效果

不同方式

nn.DataParallel单进程控制多 GPU

torch.distributed 加速并行训练

使用 torch.multiprocessing 取代启动器

使用 apex 再加速

pytorch 为我们提供了 torch.distributed.launch 启动器，用于在命令行分布式地执行 python 文件。

虽然现在建议用run。但是好多论文sota都用的launch，也不是不能用

1	CUDA_VISIBLE_DEVICES=0,1,2 python -m torch.distributed.launch --nproc_per_node=3 main.py

手动使用 torch.multiprocessing 进行多进程控制。绕开 torch.distributed.launch 自动控制开启和退出进程的一些小毛病

启动器会将当前进程的index 参数传递给程序

使用 init_process_group 设置GPU 之间通信使用的后端和端口

包装数据集

包装模型

optimizer