分布式训练一开始就卡住了

问题描述

我使用 allennlp 框架进行 nlp 学习。使用单个 GPU 时,它可以工作。但是当我改成多gpu的时候,一开始就卡住了。

配置在单 GPU 下运行良好。

环境

using anaconda
ubuntu 16.04

pytorch==1.7.1
allennlp==1.3.0
nvcc -V v10.2.89
driver version: 440.33.01
cuda version: 10.2

我用的是 1080ti * 2 和 AMD Ryzen 5 1600

程序生成 3 个日志。out.logout_worker0.logout_worker1.log

在下面列出它们

# out.log

2020-12-25 14:54:22,558 - INFO - allennlp.common.params - datasets_for_vocab_creation = None
2020-12-25 14:54:22,558 - INFO - allennlp.common.params - dataset_reader.type = my_simple_reader
2020-12-25 14:54:22,559 - INFO - allennlp.common.params - dataset_reader.lazy = False
2020-12-25 14:54:22,559 - INFO - allennlp.common.params - dataset_reader.cache_directory = None
2020-12-25 14:54:22,559 - INFO - allennlp.common.params - dataset_reader.max_instances = None
2020-12-25 14:54:22,559 - INFO - allennlp.common.params - dataset_reader.manual_distributed_sharding = False
2020-12-25 14:54:22,559 - INFO - allennlp.common.params - dataset_reader.manual_multi_process_sharding = False
2020-12-25 14:54:22,559 - INFO - allennlp.common.params - train_data_path = data/train.txt
2020-12-25 14:54:22,559 - INFO - allennlp.training.util - Reading training data from data/train.txt
2020-12-25 14:54:22,561 - INFO - tqdm - reading instances: 0it [00:00,?it/s]
2020-12-25 14:54:23,212 - INFO - allennlp.common.params - vocabulary.type = from_instances
2020-12-25 14:54:23,213 - INFO - allennlp.common.params - vocabulary.min_count = None
2020-12-25 14:54:23,213 - INFO - allennlp.common.params - vocabulary.max_vocab_size = None
2020-12-25 14:54:23,213 - INFO - allennlp.common.params - vocabulary.non_padded_namespaces = ('*tags','*labels')
2020-12-25 14:54:23,213 - INFO - allennlp.common.params - vocabulary.pretrained_files = None
2020-12-25 14:54:23,213 - INFO - allennlp.common.params - vocabulary.only_include_pretrained_words = False
2020-12-25 14:54:23,213 - INFO - allennlp.common.params - vocabulary.tokens_to_add = None
2020-12-25 14:54:23,213 - INFO - allennlp.common.params - vocabulary.min_pretrained_embeddings = None
2020-12-25 14:54:23,213 - INFO - allennlp.common.params - vocabulary.padding_token = @@PADDING@@
2020-12-25 14:54:23,213 - INFO - allennlp.common.params - vocabulary.oov_token = @@UNKNOWN@@
2020-12-25 14:54:23,213 - INFO - allennlp.data.vocabulary - Fitting token dictionary from dataset.
2020-12-25 14:54:23,214 - INFO - tqdm - building vocab: 0it [00:00,214 - INFO - allennlp.training.util - writing the vocabulary to tmp/debugger/vocabulary.
2020-12-25 14:54:23,214 - INFO - allennlp.training.util - done creating vocab
2020-12-25 14:54:23,214 - INFO - root - Switching to distributed training mode since multiple GPUs are configured | Master is at: 127.0.0.1:37039 | Rank of this node: 0 | Number of workers in this node: 2 | Number of nodes: 1 | World size: 2

# out_worker0.log

0 | 2020-12-25 14:54:24,863 - INFO - allennlp.common.params - random_seed = 13370
0 | 2020-12-25 14:54:24,863 - INFO - allennlp.common.params - numpy_seed = 1337
0 | 2020-12-25 14:54:24,863 - INFO - allennlp.common.params - pytorch_seed = 133
0 | 2020-12-25 14:54:24,864 - INFO - allennlp.common.checks - Pytorch version: 1.7.1
# out_worker1.log

1 | 2020-12-25 14:54:24,826 - INFO - allennlp.common.params - random_seed = 13370
1 | 2020-12-25 14:54:24,826 - INFO - allennlp.common.params - numpy_seed = 1337
1 | 2020-12-25 14:54:24,826 - INFO - allennlp.common.params - pytorch_seed = 133
1 | 2020-12-25 14:54:24,827 - INFO - allennlp.common.checks - Pytorch version: 1.7.1

它卡住了 10 多分钟。所以我 ctrl-c 来中断它。消息如下:

^CTraceback (most recent call last):
  File "/home/axx/anaconda3/envs/allen-test/bin/allennlp",line 8,in <module>
    sys.exit(run())
  File "/home/axx/anaconda3/envs/allen-test/lib/python3.6/site-packages/allennlp/__main__.py",line 34,in run
    main(prog="allennlp")
  File "/home/axx/anaconda3/envs/allen-test/lib/python3.6/site-packages/allennlp/commands/__init__.py",line 118,in main
    args.func(args)
  File "/home/axx/anaconda3/envs/allen-test/lib/python3.6/site-packages/allennlp/commands/train.py",line 119,in train_model_from_args
    file_friendly_logging=args.file_friendly_logging,File "/home/axx/anaconda3/envs/allen-test/lib/python3.6/site-packages/allennlp/commands/train.py",line 178,in train_model_from_file
    file_friendly_logging=file_friendly_logging,line 323,in train_model
    nprocs=num_procs,File "/home/axx/anaconda3/envs/allen-test/lib/python3.6/site-packages/torch/multiprocessing/spawn.py",line 199,in spawn
    return start_processes(fn,args,nprocs,join,daemon,start_method='spawn')
  File "/home/axx/anaconda3/envs/allen-test/lib/python3.6/site-packages/torch/multiprocessing/spawn.py",line 157,in start_processes
    while not context.join():
  File "/home/axx/anaconda3/envs/allen-test/lib/python3.6/site-packages/torch/multiprocessing/spawn.py",line 77,in join
    timeout=timeout,File "/home/axx/anaconda3/envs/allen-test/lib/python3.6/multiprocessing/connection.py",line 911,in wait
    ready = selector.select(timeout)
  File "/home/axx/anaconda3/envs/allen-test/lib/python3.6/selectors.py",line 376,in select
    fd_event_list = self._poll.poll(timeout)
KeyboardInterrupt
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/axx/anaconda3/envs/allen-test/lib/python3.6/multiprocessing/popen_fork.py",line 28,in poll
    pid,sts = os.waitpid(self.pid,flag)
KeyboardInterrupt

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)

相关问答

错误1:Request method ‘DELETE‘ not supported 错误还原:...
错误1:启动docker镜像时报错:Error response from daemon:...
错误1:private field ‘xxx‘ is never assigned 按Alt...
报错如下,通过源不能下载,最后警告pip需升级版本 Requirem...