[AMBER] pmemd.cuda.MPI on Comet- MPI dying from Kenneth Huang on 2015-10-21 (Amber Archive Oct 2015)

From: Kenneth Huang <kennethneltharion.gmail.com>
Date: Wed, 21 Oct 2015 12:30:41 -0400

Dear all,

I'm running into a very strange error when I'm running two systems with
pmemd.cuda.MPI on SDSC's Comet. Specifically, I'm running both jobs in
parallel on the GPU nodes, so 1 K80 or two GPUs per job.

However, what's happening is that one of the jobs runs without a problem,
but the second one seems to die/hang at the start of the job?

The associated error message in the output file is-

[comet-31-05.sdsc.edu:mpi_rank_1][dreg_register]
[src/mpid/ch3/channels/common/s
rc/reg_cache/dreg.c:1024] cuda failed with 500
[comet-31-05.sdsc.edu:mpispawn_0][readline] Unexpected End-Of-File on file
descr
iptor 5. MPI process died?
[comet-31-05.sdsc.edu:mpispawn_0][mtpmi_processops] Error while reading PMI
sock
et. MPI process died?
[comet-31-05.sdsc.edu:mpispawn_0][child_handler] MPI process (rank: 1, pid:
7419
4) exited with status 1
[comet-31-05.sdsc.edu:mpispawn_0][report_error] connect() failed:
Connection ref
used (111)

Which made me initially think there was something wrong with my input, or
the restart file since I've run the systems without problems before. But
when I checked them, I couldn't find any issues, so I tested it for short
runs on Comet and locally as a serial job, which ran without any errors.
Yet when I try to run it on the same setup, it reproduces the same behavior
with one job running and the other failing. What's even more bizarre is
that I can reproduce the behavior across different nodes, and that another
pair of similar systems don't have any issues when running with a
functionally identical setup.

Based on that, my question is if this is a very strange bug or performance
error? Or is it a problem with MPI somehow dying, or running out of
resources?

For reference, the running part of the submission script is-

export SLURM_NODEFILE=`generate_pbs_nodefile` mpirun_rsh -hostfile
$SLURM_NODEFILE -np 2 MV2_USE_CUDA=1 MV2_USE_GPUDIRECT_GDRCOPY=0
MV2_CPU_MAPPING=0:2 CUDA_VISIBLE_DEVICES=0,1
/share/apps/gpu/amber/pmemd.cuda.MPI -O -i 06A_prod.in -o 06A_prod1.out -p
A.prmtop -c 05A_preprod2.rst -r 06A_prod1.rst -x 06A_prod1.nc -inf
06A_prod1.mdinfo -l 06A_prod1.log &

mpirun_rsh -hostfile $SLURM_NODEFILE -np 2 MV2_USE_CUDA=1
MV2_USE_GPUDIRECT_GDRCOPY=0 MV2_CPU_MAPPING=1:3 CUDA_VISIBLE_DEVICES=2,3
/share/apps/gpu/amber/pmemd.c uda.MPI -O -i 06B_prod.in -o 06B_prod1.out
-p B.prmtop -c 06B_preprod2.rst -r 06B_prod1.rst -x 06B_prod1.nc -inf
06B_prod1.mdinfo
-l 06B_prod1
.log &

wait

Best,

Kenneth

-- 
Ask yourselves, all of you, what power would hell have if those imprisoned
here could not dream of heaven?
-- 
Ask yourselves, all of you, what power would hell have if those imprisoned
here could not dream of heaven?
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Wed Oct 21 2015 - 10:00:03 PDT