Re: [AMBER] pmemd.cuda.MPI on Comet- MPI dying from Kenneth Huang on 2015-10-22 (Amber Archive Oct 2015)

From: Kenneth Huang <kennethneltharion.gmail.com>
Date: Thu, 22 Oct 2015 10:08:11 -0400

Hi Ross,

Yes, both the inputs and systems themselves are almost identical- 06B has a
ligand that 06A doesn't have, so the only difference in the inputs is the
nmr restraint file that they refer to.

Right, that's just a typo in my email.

The three times I tested it all failed with the same error message for
first job (06A), which made me think there was something specifically wrong
with that job or the files associated with it, which turned out not to be
the case. I haven't tested them in reverse, but I just submitted a job
putting 06B first, and 06A on the second GPU, so I'll see what happens.

Best,

Kenneth

On Wed, Oct 21, 2015 at 5:14 PM, Ross Walker <ross.rosswalker.co.uk> wrote:

> Hi Kenneth,
>
> To confirm the inputs are identical here yes?
>
> As in the contents of the 06A and 06B files are identical?
>
> And I assume the space in your command line here:
>
>
> > /share/apps/gpu/amber/pmemd.c uda.MPI -O -i 06B_prod.in -o 06B_prod1.out
>
> Is an email typo and not a real typo in your script?
>
> And is it always the same job that fails? What if you run 06B on GPUs 0,1
> and 06A on 2,3?
>
> All the best
> Ross
>
> > On Oct 21, 2015, at 09:30, Kenneth Huang <kennethneltharion.gmail.com>
> wrote:
> >
> > Dear all,
> >
> > I'm running into a very strange error when I'm running two systems with
> > pmemd.cuda.MPI on SDSC's Comet. Specifically, I'm running both jobs in
> > parallel on the GPU nodes, so 1 K80 or two GPUs per job.
> >
> > However, what's happening is that one of the jobs runs without a problem,
> > but the second one seems to die/hang at the start of the job?
> >
> > The associated error message in the output file is-
> >
> > [comet-31-05.sdsc.edu:mpi_rank_1][dreg_register]
> > [src/mpid/ch3/channels/common/s
> > rc/reg_cache/dreg.c:1024] cuda failed with 500
> > [comet-31-05.sdsc.edu:mpispawn_0][readline] Unexpected End-Of-File on
> file
> > descr
> > iptor 5. MPI process died?
> > [comet-31-05.sdsc.edu:mpispawn_0][mtpmi_processops] Error while reading
> PMI
> > sock
> > et. MPI process died?
> > [comet-31-05.sdsc.edu:mpispawn_0][child_handler] MPI process (rank: 1,
> pid:
> > 7419
> > 4) exited with status 1
> > [comet-31-05.sdsc.edu:mpispawn_0][report_error] connect() failed:
> > Connection ref
> > used (111)
> >
> > Which made me initially think there was something wrong with my input, or
> > the restart file since I've run the systems without problems before. But
> > when I checked them, I couldn't find any issues, so I tested it for short
> > runs on Comet and locally as a serial job, which ran without any errors.
> > Yet when I try to run it on the same setup, it reproduces the same
> behavior
> > with one job running and the other failing. What's even more bizarre is
> > that I can reproduce the behavior across different nodes, and that
> another
> > pair of similar systems don't have any issues when running with a
> > functionally identical setup.
> >
> > Based on that, my question is if this is a very strange bug or
> performance
> > error? Or is it a problem with MPI somehow dying, or running out of
> > resources?
> >
> > For reference, the running part of the submission script is-
> >
> > export SLURM_NODEFILE=`generate_pbs_nodefile` mpirun_rsh -hostfile
> > $SLURM_NODEFILE -np 2 MV2_USE_CUDA=1 MV2_USE_GPUDIRECT_GDRCOPY=0
> > MV2_CPU_MAPPING=0:2 CUDA_VISIBLE_DEVICES=0,1
> > /share/apps/gpu/amber/pmemd.cuda.MPI -O -i 06A_prod.in -o 06A_prod1.out
> -p
> > A.prmtop -c 05A_preprod2.rst -r 06A_prod1.rst -x 06A_prod1.nc -inf
> > 06A_prod1.mdinfo -l 06A_prod1.log &
> >
> > mpirun_rsh -hostfile $SLURM_NODEFILE -np 2 MV2_USE_CUDA=1
> > MV2_USE_GPUDIRECT_GDRCOPY=0 MV2_CPU_MAPPING=1:3 CUDA_VISIBLE_DEVICES=2,3
> > /share/apps/gpu/amber/pmemd.c uda.MPI -O -i 06B_prod.in -o 06B_prod1.out
> > -p B.prmtop -c 06B_preprod2.rst -r 06B_prod1.rst -x 06B_prod1.nc -inf
> > 06B_prod1.mdinfo
> > -l 06B_prod1
> > .log &
> >
> > wait
> >
> >
> > Best,
> >
> > Kenneth
> > --
> > Ask yourselves, all of you, what power would hell have if those
> imprisoned
> > here could not dream of heaven?
> >
> >
> > --
> > Ask yourselves, all of you, what power would hell have if those
> imprisoned
> > here could not dream of heaven?
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>

-- 
Ask yourselves, all of you, what power would hell have if those imprisoned
here could not dream of heaven?
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Thu Oct 22 2015 - 07:30:07 PDT