问题描述
我通过 slurm 提交了一份工作。该作业运行了 12 个小时,并按预期工作。然后我得到了Data unpack would read past end of buffer in file util/show_help.c at line 501
。我经常会遇到 ORTE has lost communication with a remote daemon
之类的错误,但我通常会在工作开始时遇到这种情况。这很烦人,但仍然不会造成与 12 小时后出现错误一样多的时间损失。有没有快速解决这个问题的方法?开放 MPI 版本为 4.0.1。
--------------------------------------------------------------------------
By default,for Open MPI 4.0 and later,infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: barbun40
Local adapter: mlx5_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: barbun40
Local device: mlx5_0
--------------------------------------------------------------------------
[barbun21.yonetim:48390] [[15284,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in
file util/show_help.c at line 501
[barbun21.yonetim:48390] 127 more processes have sent help message help-mpi-btl-openib.txt / ib port
not selected
[barbun21.yonetim:48390] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error
messages
[barbun21.yonetim:48390] 126 more processes have sent help message help-mpi-btl-openib.txt / error in
device init
--------------------------------------------------------------------------
Primary job terminated normally,but 1 process returned
a non-zero exit code. Per user-direction,the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected. This
usually indicates a failure in the peer process (e.g.,a crash or
otherwise exiting without calling MPI_FINALIZE first).
Although this local MPI process will likely Now behave unpredictably
(it may even hang or crash),the root cause of this problem is the
failure of the peer -- that is what you need to investigate. For
example,there may be a core file that you can examine. More
generally: such peer hangups are frequently caused by application bugs
or other external events.
Local host: barbun64
Local PID: 252415
Peer host: barbun39
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status,thus causing
the job to be terminated. The first process to do so was:
Process name: [[15284,1],35]
Exit code: 9
--------------------------------------------------------------------------
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)