[AMBER] problem with using amber

From: Feng Su <feng.su.tandemai.com>
Date: Thu, 31 Mar 2022 05:04:59 +0000

Hi Team
This is Feng, a user of Amber.

I found some problem when using Amber with MPI in Slurm GPU Cluster.
We deployed the Docker service in our Computing nodes.
The new interface docker0 was add before default interface ens3 and it caused the failure.

-----------------------------------------------------------------
docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
        inet 172.17.0.1 netmask 255.255.0.0 broadcast 172.17.255.255
        inet6 fe80::42:75ff:fecd:fbaa prefixlen 64 scopeid 0x20<link>
...
ens3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
        inet 192.168.100.16 netmask 255.255.255.0 broadcast 192.168.100.255
        inet6 fe80::f652:14ff:fe89:24d0 prefixlen 64 scopeid 0x20<link>
...
-----------------------------------------------------------------

When we disable the interface docker0, it back to normal.
It seem amber use “172.17.0.1” as the communication IP instead of “192.168.100.16”.
I want to confirm is it a bug and how can we fix the problem?
I hope to get your reply.


Commands and reference log:
--------------------------------------------------------------------------
mpirun -mca btl ^openib -np 14 --timeout 86400 pmemd.cuda_SPFP.MPI -ng 14 -groupfile groupfile -rem 3
----------------------------------output for above commands----------------------------------------
Open MPI detected an inbound MPI TCP connection request from a peer
that appears to be part of this MPI job (i.e., it identified itself as
part of this Open MPI job), but it is from an IP address that is
unexpected. This is highly unusual.
The inbound connection has been dropped, and the peer should simply
try again with a different IP interface (i.e., the job should
hopefully be able to continue).
Local host: CM024
Local PID: 33360
Peer hostname: CM024 ([[25346,1],0])
Source IP of socket: 192.168.100.24
Known IPs of peer:
--------------------------------------------------------------------------
[CM024][[25346,1],0][btl_tcp_endpoint.c:649:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[25346,1],1]
[CM024][[25346,1],1][btl_tcp_endpoint.c:649:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[25346,1],3]
[CM024][[25346,1],3][btl_tcp_endpoint.c:649:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[25346,1],0]
[CM024][[25346,1],1][btl_tcp_endpoint.c:649:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[25346,1],0]
[CM024:33347] 2 more processes have sent help message help-mpi-btl-tcp.txt / dropped inbound connection
[CM024:33347] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
mpirun: Forwarding signal 18 to job
--------------------------------------------------------------------------


--
Su Feng
Best Regards,
--------------------------------------------
Tel: +86 186 5107 0620
E-Mail: feng.su.tandemai.com<mailto:feng.su.tandemai.com>
Address: Room 2102-2103, Block C, Suzhou Center,
Suzhou Industry Park, Jiangsu, China

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Mar 30 2022 - 22:30:02 PDT
Custom Search