Re: [AMBER] pmemd.cuda.MPI on Comet- MPI dying from Ross Walker on 2015-10-21 (Amber Archive Oct 2015)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Wed, 21 Oct 2015 14:14:28 -0700

Hi Kenneth,

To confirm the inputs are identical here yes?

As in the contents of the 06A and 06B files are identical?

And I assume the space in your command line here:

> /share/apps/gpu/amber/pmemd.c uda.MPI -O -i 06B_prod.in -o 06B_prod1.out

Is an email typo and not a real typo in your script?

And is it always the same job that fails? What if you run 06B on GPUs 0,1 and 06A on 2,3?

All the best
Ross

> On Oct 21, 2015, at 09:30, Kenneth Huang <kennethneltharion.gmail.com> wrote:
>
> Dear all,
>
> I'm running into a very strange error when I'm running two systems with
> pmemd.cuda.MPI on SDSC's Comet. Specifically, I'm running both jobs in
> parallel on the GPU nodes, so 1 K80 or two GPUs per job.
>
> However, what's happening is that one of the jobs runs without a problem,
> but the second one seems to die/hang at the start of the job?
>
> The associated error message in the output file is-
>
> [comet-31-05.sdsc.edu:mpi_rank_1][dreg_register]
> [src/mpid/ch3/channels/common/s
> rc/reg_cache/dreg.c:1024] cuda failed with 500
> [comet-31-05.sdsc.edu:mpispawn_0][readline] Unexpected End-Of-File on file
> descr
> iptor 5. MPI process died?
> [comet-31-05.sdsc.edu:mpispawn_0][mtpmi_processops] Error while reading PMI
> sock
> et. MPI process died?
> [comet-31-05.sdsc.edu:mpispawn_0][child_handler] MPI process (rank: 1, pid:
> 7419
> 4) exited with status 1
> [comet-31-05.sdsc.edu:mpispawn_0][report_error] connect() failed:
> Connection ref
> used (111)
>
> Which made me initially think there was something wrong with my input, or
> the restart file since I've run the systems without problems before. But
> when I checked them, I couldn't find any issues, so I tested it for short
> runs on Comet and locally as a serial job, which ran without any errors.
> Yet when I try to run it on the same setup, it reproduces the same behavior
> with one job running and the other failing. What's even more bizarre is
> that I can reproduce the behavior across different nodes, and that another
> pair of similar systems don't have any issues when running with a
> functionally identical setup.
>
> Based on that, my question is if this is a very strange bug or performance
> error? Or is it a problem with MPI somehow dying, or running out of
> resources?
>
> For reference, the running part of the submission script is-
>
> export SLURM_NODEFILE=`generate_pbs_nodefile` mpirun_rsh -hostfile
> $SLURM_NODEFILE -np 2 MV2_USE_CUDA=1 MV2_USE_GPUDIRECT_GDRCOPY=0
> MV2_CPU_MAPPING=0:2 CUDA_VISIBLE_DEVICES=0,1
> /share/apps/gpu/amber/pmemd.cuda.MPI -O -i 06A_prod.in -o 06A_prod1.out -p
> A.prmtop -c 05A_preprod2.rst -r 06A_prod1.rst -x 06A_prod1.nc -inf
> 06A_prod1.mdinfo -l 06A_prod1.log &
>
> mpirun_rsh -hostfile $SLURM_NODEFILE -np 2 MV2_USE_CUDA=1
> MV2_USE_GPUDIRECT_GDRCOPY=0 MV2_CPU_MAPPING=1:3 CUDA_VISIBLE_DEVICES=2,3
> /share/apps/gpu/amber/pmemd.c uda.MPI -O -i 06B_prod.in -o 06B_prod1.out
> -p B.prmtop -c 06B_preprod2.rst -r 06B_prod1.rst -x 06B_prod1.nc -inf
> 06B_prod1.mdinfo
> -l 06B_prod1
> .log &
>
> wait
>
>
> Best,
>
> Kenneth
> --
> Ask yourselves, all of you, what power would hell have if those imprisoned
> here could not dream of heaven?
>
>
> --
> Ask yourselves, all of you, what power would hell have if those imprisoned
> here could not dream of heaven?
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Oct 21 2015 - 14:30:03 PDT