如何通过WSL2的NAT使用多节点openmpi?

问题描述

使用一台WSL2“机器” wsl001和一台真正的Linux机器linux002,我注意到我什至不能简单地按照https://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems运行mpirun --host linux hostname

------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
  Local host:    wsl001
  Remote host:   linux002
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g.,iptables) has been disabled and
try again.
------------------------------------------------------------
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings,or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g.,on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform deFinitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------

我认为最后一点是问题所在,因为ssh linux002 mpirun hostname工作正常。

使用--mca plm_base_verbose 10标志,我注意到了这一行

[wsl001:18696] [[11212,0],0] plm:rsh: final template argv:
    /usr/sbin/ssh <template>  orted -mca ess "env" -mca ess_base_jobid "734789632" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "2" -mca orte_node_regex "wsl[3:48],linux[3:1]@0(2)" -mca orte_hnp_uri "734789632.0;tcp://172.17.45.213:42213" --mca plm_base_verbose "10" -mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri "734789632.0;tcp://172.17.45.213:42213" -mca pmix "^s1,s2,cray,isolated"

使用WSL内部NAT IP 172.17.45.213代替外部IP。是的,当然,WSL2网络会出现问题...正如OpenMPI FAQ所述,“ Open MPI在单个MPI作业中在主机之间打开随机TCP,有时在主机之间打开随机UDP端口”,所以我不能简单地将特定端口转发到WSL计算机从其主机,也不清楚SSH隧道如何在这里提供帮助。由于WSL机器的内部IP不会保持恒定,因此我什至无法为SSH端口进行永久转发(加上Windows主机为其自身的SSHD实例阻塞了端口22,即使未使用它也是如此)。

是否还有其他选择可以使WSL2机器在OpenMPI环境中正常工作?使SSH也能以其他方式工作是否足够?还是WSL-NAT仍会弄乱端口转发?

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)