Slurm MPI 错误:ORTE 守护进程失败

问题描述

我在集群上遇到了 Slurm 和 openMPI 的一些问题。每当我运行任何使用 mpirun 的作业时,都会收到以下错误

--------------------------------------------------------------------------
An ORTE daemon has unexpectedly Failed after launch and before
communicating back to mpirun. This Could be caused by a number
of factors,including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------

这个问题突然出现,而且这个问题似乎在计算节点中无处不在。

看似相关,srun 现在也失败了,并显示以下消息:

srun: error: Task launch for <jobid> Failed on node <nodename>: Job credential expired
srun: error: Application launch Failed: Job credential expired
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

感谢任何人的帮助!

编辑:添加示例

如果我在头节点上运行 mpirun hostname,一切正常。但是,在 slurm 分配 (salloc) 中,当我运行 mpirun hostname 时,出现错误

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)