问题描述
我在集群上遇到了 Slurm 和 openMPI 的一些问题。每当我运行任何使用 mpirun
的作业时,都会收到以下错误:
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly Failed after launch and before
communicating back to mpirun. This Could be caused by a number
of factors,including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
这个问题突然出现,而且这个问题似乎在计算节点中无处不在。
看似相关,srun
现在也失败了,并显示以下消息:
srun: error: Task launch for <jobid> Failed on node <nodename>: Job credential expired
srun: error: Application launch Failed: Job credential expired
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
感谢任何人的帮助!
编辑:添加示例
如果我在头节点上运行 mpirun hostname
,一切正常。但是,在 slurm 分配 (salloc
) 中,当我运行 mpirun hostname
时,出现错误。
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)