

使用一台WSL2“机器” wsl001和一台真正的Linux机器linux002,我注意到我什至不能简单地按照https://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems运行mpirun --host linux hostname

A process or daemon was unable to complete a TCP connection
to another process:
  Local host:    wsl001
  Remote host:   linux002
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g.,iptables) has been disabled and
try again.
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings,or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g.,on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform deFinitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).

我认为最后一点是问题所在,因为ssh linux002 mpirun hostname工作正常。

使用--mca plm_base_verbose 10标志,我注意到了这一行

[wsl001:18696] [[11212,0],0] plm:rsh: final template argv:
    /usr/sbin/ssh <template>  orted -mca ess "env" -mca ess_base_jobid "734789632" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "2" -mca orte_node_regex "wsl[3:48],linux[3:1]@0(2)" -mca orte_hnp_uri "734789632.0;tcp://" --mca plm_base_verbose "10" -mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri "734789632.0;tcp://" -mca pmix "^s1,s2,cray,isolated"

使用WSL内部NAT IP代替外部IP。是的,当然,WSL2网络会出现问题...正如OpenMPI FAQ所述,“ Open MPI在单个MPI作业中在主机之间打开随机TCP,有时在主机之间打开随机UDP端口”,所以我不能简单地将特定端口转发到WSL计算机从其主机,也不清楚SSH隧道如何在这里提供帮助。由于WSL机器的内部IP不会保持恒定,因此我什至无法为SSH端口进行永久转发(加上Windows主机为其自身的SSHD实例阻塞了端口22,即使未使用它也是如此)。





