[AMBER] mpirun <defunct> with pmemd.cuda.MPI on EXXACT systems from Jardin, Christophe Dr. on 2019-07-09 (Amber Archive Jul 2019)

From: Jardin, Christophe Dr. <Christophe.Jardin.klinikum-nuernberg.de>
Date: Tue, 9 Jul 2019 13:57:48 +0000

Dear all,

surely not a problem concerning AMBER, rather a MPI problem, but since it happens to me using Amber on EXXACT 'AMBER Certified MD Systems' I hope to find someone here who can help me.

I'm running constant pH replica exchange simulations using either Amber16 or Amber18.
According to the output files it seems to me that the simulations are running properly. However, each time the simulations reach end, they come into mpirun <defunct> and don't terminate properly.

I start the simulations as following:
>unset CUDA_VISIBLE_DEVICES
>export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 (on a Exxact Rack system comprising 8 GPUs that are all communicating with each other according to 'gpuP2PCheck'; I also tested with export CUDA_VISIBLE_DEVICES=0,1 and export CUDA_VISIBLE_DEVICES=0 only but it didn't solve the problem - Amber16 installed on this system)
(I also made similar tests with Amber18 on an EXXACT workstation with 4 GPUs where GPUs 0 and 1, and GPUs 2 and 3 communicate with each other according to 'gpuP2PCheck; here I tried both export CUDA_VISIBLE_DEVICES=0,1,2,3 and export CUDA_VISIBLE_DEVICES=0,1; but I always get the same failure)
I then launch the simulations with:
>./MY_JOB & (with the command: 'mpirun -np 16 pmemd.cuda.MPI -ng 16 -groupfile groupfile -rem 4 -remlog Prod.log' in MY_JOB)

The jobs start properly, I get:
Running multipmemd version of pmemd Amber 16
Total processors = 16
Number of groups = 16

The logfile.XXX (XXX=0...15) are created and written by the end of the simulation. The output file Prod.log is created and actualized as the simulation runs with the newly attempted exchanges until the last exchange is reached. All other files -cpin, cpout, mdinfo, mdout, nc , rst7...- for each replica are also created and written properly.

As a result of the command 'ps', as long as the job is running there are one process for MY_JOB, one for mpirun, one for hydra_pmi_proxy, and 16 processes for the different pmemd.cuda.MPI.
As the max. number of exchanges is reached, the processes for pmemd.cuda.MPI and hydra_pmi_proxy disappear (not listed anymore with 'ps'), but the status of the process mpirun becomes <defunct> and MY_JOB as status 'T' according to the command 'top'.

This is especially embarassing because I would like to run the simulation in several shorter simulation steps rather than only one long simulation, and automatically restart the next step when the previous one is finished using a script. However, since each step doesn't terminate properly it never starts the next step automatically.

Note, that I observe the same problem when using sander.MPI

The mpi distribution used is mpich3.1.4
Running mpirun in verbose mode, the last print is (the two lines below 16 times; with XX taking different values for each recurrence):
[proxy:0:0.c103074] got pmi command (from XX): finalize
[proxy:0:0.c103074] PMI response: cmd=finalize_ack

Did anyone encountered the same or a similar problem?
Or anyone as a suggestion what the problem could be and how to solve it?

Thanks a lot in advance!
Christophe

________________________________

Klinikum Nürnberg, Sitz: Nürnberg, Amtsgericht Nürnberg -Registergericht- HRA 14190, Vorstand: Prof. Dr. Achim Jockwig (Vorsitzender), Dr. Andreas Becke, Univ.-Prof. Dr. Dr. Günter Niklewski, Peter Schuh
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Jul 09 2019 - 07:00:05 PDT