问题描述
使用一台WSL2“机器” wsl001
和一台真正的Linux机器linux002
,我注意到我什至不能简单地按照https://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems运行mpirun --host linux hostname
:
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
Local host: wsl001
Remote host: linux002
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g.,iptables) has been disabled and
try again.
------------------------------------------------------------
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings,or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g.,on Cray). Please check your configure cmd line and consider using
one of the contrib/platform deFinitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
我认为最后一点是问题所在,因为ssh linux002 mpirun hostname
工作正常。
使用--mca plm_base_verbose 10
标志,我注意到了这一行
[wsl001:18696] [[11212,0],0] plm:rsh: final template argv:
/usr/sbin/ssh <template> orted -mca ess "env" -mca ess_base_jobid "734789632" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "2" -mca orte_node_regex "wsl[3:48],linux[3:1]@0(2)" -mca orte_hnp_uri "734789632.0;tcp://172.17.45.213:42213" --mca plm_base_verbose "10" -mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri "734789632.0;tcp://172.17.45.213:42213" -mca pmix "^s1,s2,cray,isolated"
使用WSL内部NAT IP 172.17.45.213代替外部IP。是的,当然,WSL2网络会出现问题...正如OpenMPI FAQ所述,“ Open MPI在单个MPI作业中在主机之间打开随机TCP,有时在主机之间打开随机UDP端口”,所以我不能简单地将特定端口转发到WSL计算机从其主机,也不清楚SSH隧道如何在这里提供帮助。由于WSL机器的内部IP不会保持恒定,因此我什至无法为SSH端口进行永久转发(加上Windows主机为其自身的SSHD实例阻塞了端口22,即使未使用它也是如此)。
是否还有其他选择可以使WSL2机器在OpenMPI环境中正常工作?使SSH也能以其他方式工作是否足够?还是WSL-NAT仍会弄乱端口转发?
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)