Re: [AMBER] pmemd.cuda.MPI on Comet- MPI dying from Kenneth Huang on 2015-10-22 (Amber Archive Oct 2015)

From: Kenneth Huang <kennethneltharion.gmail.com>
Date: Thu, 22 Oct 2015 23:39:47 -0400

Hi Ross,

Right, I suppose functionally identical would be a better description.

First thing first is to run jobs with 'IDENTICAL' input on both sets of
> GPUs. If you see it fail on one set but not the other then it means it is a
> machine configuration issue / bios / etc and I can escalate it to SDSC
> support.
>

That's what happens when it fails- I haven't been able to see it fail on
both jobs yet, which might be an issue of running more tests. Whatever is
the first job in the script seems to be the one to fail when the error pops
up.

With identical inputs for both GPUs using 06B, the first job in the script
failed while the second was able to run. But resubmitting the exact same
job, both jobs ran without any issues. Doing the same thing for 06A didn't
see any issues on two tries, even though that job was the one originally
failing in the past (possibly because it was first).

If I try with different inputs, ie swapping 06B on GPUs 0,1 and 06A on 2,3,
then when it's the one on top (06B) that fails. Likewise, the bottom one
that previously kept hitting the MPI error (06A) no longer has any issues.

I actually opened a ticket with SDSC support through the XSEDE help desk
earlier this week about this and some bizarre performance drops on one of
the GPU nodes, but we couldn't figure out if this problem was a bug or a
resource issue, so I figured to check and see.

If it fails on both (or runs fine on both) then it says it is something
> with your job and we can attempt to find if there is a bug in the GPU code
> or something weird about your input. To do this though I need input that
> fails on any combination of 2 GPUs.

That's the part I can't get my head around is that the error doesn't seem
to be consistent? The 06B job mentioned above used the below script and
failed with first job, but work fine on both when I resubmitted it without
changing anything.

#!/bin/bash
#SBATCH --job-name="testB1"
#SBATCH --output="comet.%j.%N.out"
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=24
#SBATCH --no-requeue
#SBATCH --gres=gpu:4
#SBATCH --export=ALL
#SBATCH -t 00:10:000
#SBATCH -A TG-TRA130030
#SBATCH --mail-type=begin
#SBATCH --mail-type=end

module load amber/14
module load intel/2015.2.164
module load cuda/6.5
module load mvapich2_gdr

export SLURM_NODEFILE=`generate_pbs_nodefile`
mpirun_rsh -hostfile $SLURM_NODEFILE -np 2 MV2_USE_CUDA=1
MV2_USE_GPUDIRECT_GDRCOPY=0 MV2_CPU_MAPPING=0:2 CUDA_VISIBLE_DEVICES=0,1
/share/apps/gpu/amber/pmemd.cuda.MPI -O -i 06B_prod.in -o testB1.out -p
B.prmtop -c 06B_preprod2.rst -r testB1.rst -x testB1.nc -inf testB1.mdinfo
-l testB1.log &

mpirun_rsh -hostfile $SLURM_NODEFILE -np 2 MV2_USE_CUDA=1
MV2_USE_GPUDIRECT_GDRCOPY=0 MV2_CPU_MAPPING=1:3 CUDA_VISIBLE_DEVICES=2,3
/share/apps/gpu/amber/pmemd.cuda.MPI -O -i 06B_prod.in -o testB2.out -p
B.prmtop -c 06B_preprod2.rst -r testB2.rst -x testB2.nc -inf testB2.mdinfo
-l testB2.log &

wait

Best,

Kenneth

On Thu, Oct 22, 2015 at 12:16 PM, Ross Walker <rosscwalker.gmail.com> wrote:

> Hi Kenneth,
>
> > Yes, both the inputs and systems themselves are almost identical- 06B
> has a
> > ligand that 06A doesn't have, so the only difference in the inputs is the
> > nmr restraint file that they refer to.
> >
>
> So they are not the same. There is no such thing as 'almost' identical.
> Same as there is no such thing as 'almost' unique. The terms identical and
> unique are absolute adjectives. They can be true or false but nothing in
> between. The same is true of the word 'perfect' - although I note that even
> the US constitution gets this wrong with the phrase "..in order to form a
> more perfect union..."
>
> First thing first is to run jobs with 'IDENTICAL' input on both sets of
> GPUs. If you see it fail on one set but not the other then it means it is a
> machine configuration issue / bios / etc and I can escalate it to SDSC
> support.
>
> If it fails on both (or runs fine on both) then it says it is something
> with your job and we can attempt to find if there is a bug in the GPU code
> or something weird about your input. To do this though I need input that
> fails on any combination of 2 GPUs.
>
> All the best
> Ross
>
> /\
> \/
> |\oss Walker
>
> ---------------------------------------------------------
> | Associate Research Professor |
> | San Diego Supercomputer Center |
> | Adjunct Associate Professor |
> | Dept. of Chemistry and Biochemistry |
> | University of California San Diego |
> | NVIDIA Fellow |
> | http://www.rosswalker.co.uk | http://www.wmd-lab.org |
> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
> ---------------------------------------------------------
>
> Note: Electronic Mail is not secure, has no guarantee of delivery, may not
> be read every day, and should not be used for urgent or sensitive issues.
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>

-- 
Ask yourselves, all of you, what power would hell have if those imprisoned
here could not dream of heaven?
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Thu Oct 22 2015 - 21:00:04 PDT