数据解包将读取文件 util/show_help.c 中第 501 行的缓冲区末尾

问题描述

我通过 slurm 提交了一份工作。该作业运行了 12 个小时，并按预期工作。然后我得到了Data unpack would read past end of buffer in file util/show_help.c at line 501。我经常会遇到 ORTE has lost communication with a remote daemon 之类的错误，但我通常会在工作开始时遇到这种情况。这很烦人，但仍然不会造成与 12 小时后出现错误一样多的时间损失。有没有快速解决这个问题的方法？开放 MPI 版本为 4.0.1。
--------------------------------------------------------------------------                                                                                                                                                                       
By default,for Open MPI 4.0 and later,infiniband ports on a device                                                                                                                                                                         
are not used by default.  The intent is to use UCX for these devices.                                                                                                                                                                        
You can override this policy by setting the btl_openib_allow_ib MCA parameter                                                                                                                                                                    
to true.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
Local host:              barbun40                                                                                                                                                                                                            
Local adapter:           mlx5_0                                                                                                                                                                                                              
Local port:              1                                                                                                                                                                                                                                                                                                                                                                                                                                                              
--------------------------------------------------------------------------                                                                                                                                                                   
--------------------------------------------------------------------------                                                                                                                                                                   
WARNING: There was an error initializing an OpenFabrics device.                                                                                                                                                                                                                                                                                                                                                                                                                             
Local host:   barbun40                                                                                                                                                                                                                       
Local device: mlx5_0                                                                                                                                                                                                                       
--------------------------------------------------------------------------                                                                                                                                                                   
[barbun21.yonetim:48390] [[15284,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in 
file util/show_help.c at line 501                                                                                                        
[barbun21.yonetim:48390] 127 more processes have sent help message help-mpi-btl-openib.txt / ib port 
not selected                                                                                                                            
[barbun21.yonetim:48390] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error 
messages                                                                                                                                  
[barbun21.yonetim:48390] 126 more processes have sent help message help-mpi-btl-openib.txt / error in 
device init                                                                                                                            
--------------------------------------------------------------------------                                                                                                                                                                   
Primary job  terminated normally,but 1 process returned                                                                                                                                                                                     
a non-zero exit code. Per user-direction,the job has been aborted.                                                                                                                                                                          
--------------------------------------------------------------------------                                                                                                                                                                   
--------------------------------------------------------------------------                                                                                                                                                                   
An MPI communication peer process has unexpectedly disconnected.  This                                                                                                                                                                       
usually indicates a failure in the peer process (e.g.,a crash or                                                                                                                                                                            
otherwise exiting without calling MPI_FINALIZE first).                                                                                                                                                                                                                                                                                                                                                                                                                                    
Although this local MPI process will likely Now behave unpredictably                                                                                                                                                                         
(it may even hang or crash),the root cause of this problem is the                                                                                                                                                                           
failure of the peer -- that is what you need to investigate.  For                                                                                                                                                                            
example,there may be a core file that you can examine.  More                                                                                                                                                                                
generally: such peer hangups are frequently caused by application bugs                                                                                                                                                                       
or other external events.                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
Local host: barbun64                                                                                                                                                                                                                         
Local PID:  252415                                                                                                                                                                                                                           
Peer host:  barbun39                                                                                                                                                                                                                       
--------------------------------------------------------------------------                                                                                                                                                                   
--------------------------------------------------------------------------                                                                                                                                                                   
mpirun detected that one or more processes exited with non-zero status,thus causing                                                                                                                                                         
the job to be terminated. The first process to do so was:                                                                                                                                                                                                                                                                                                                                                                                                                                   
Process name: [[15284,1],35]                                                                                                                                                                                                                 
Exit code:    9                                                                                                                                                                                                                            
--------------------------------------------------------------------------
解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！
如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@）
mpi openmpi