问题描述
每当我使用torch.multiprocessing.spawn在多个GPU上并行化时,都会出现错误,包括“并行和分布式培训”教程中的代码示例。
Pytorch DDP中的示例注释:
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import distributedDataParallel as DDP
def example(rank,world_size):
# create default process group
dist.init_process_group("gloo",rank=rank,world_size=world_size)
# create local model
model = nn.Linear(10,10).to(rank)
# construct DDP model
ddp_model = DDP(model,device_ids=[rank])
# define loss function and optimizer
loss_fn = nn.MSELoss()
optimizer = optim.SGD(ddp_model.parameters(),lr=0.001)
# forward pass
outputs = ddp_model(torch.randn(20,10).to(rank))
labels = torch.randn(20,10).to(rank)
backward pass
loss_fn(outputs,labels).backward()
# update parameters
optimizer.step()
def main():
world_size = 2
mp.spawn(example,args=(world_size,),nprocs=world_size,join=True)
if __name__=="__main__":
main()
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)