Hi Ross,
Mahidhar actually went ahead and recompiled amber under gnu instead of
Intel, and all of the tests I've done with it seem to be working fine now.
That said, I am consistently seeing a message in the job log files for the
runs- the outputs from Amber don't seem to have any issues, and there's
nothing so far to suggest anything is actually wrong, but I wanted to check
if it's an actual issue.
Note: The following floating-point exceptions are signalling:
IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
Note: The following floating-point exceptions are signalling: IEEE_DENORMAL
Best,
Kenneth
On Fri, Oct 23, 2015 at 12:37 AM, Ross Walker <ross.rosswalker.co.uk> wrote:
> Hi Kenneth,
>
> A few things to try.
>
> 1) Right after the modules are loaded add: nvidia-smi -pm 1 (this will
> force loading of the nvidia driver)
>
> 2) Are you certain that the CUDA_VISIBLE_DEVICES you are specifying in the
> mpi command is getting propogated? - What does your two mdout files report
> for the value of CUDA_VISIBLE_DEVICES? If this isn't getting propogated
> that would explain it.
>
> 3) Ditch all this fancy mpi crap: MV2_USE_CUDA=1
> MV2_USE_GPUDIRECT_GDRCOPY=0 MV2_CPU_MAPPING=0:2
>
> I have no idea what these options are doing and they are probably just
> breaking things. I'd also ditch all the fancy module loads:
>
> > module load amber/14
>
> > module load intel/2015.2.164
> > module load cuda/6.5
> > module load mvapich2_gdr
>
> Load plain old vanilla GCC and Gfortran and vanilla mpich3 or mpich2 - and
> compile your own copy of amber 14 with:
>
> ./configure -cuda gnu
> make
> make clean
> ./configure -cuda -mpi gnu
> make
> make clean
>
> And you'll probably find all you problems go away. Amber GPU was written
> deliberately not to need ANY fancy compilers or mpi libraries, or fancy
> interconnects or fancy GPU direct options etc. They all just make things
> fragile. I don't know why the SDSC folk compiled the GPU code for the
> modules listed above in the first place.
>
> Ultimately you want it nice and simple. I don't have a login on Comet so I
> can't give you the exact options but something like
>
> module load gnu/4.4.7
> module load mpich3
> module load cuda/6.5
>
> copy in your own amber 14 and AmberTools15 tar files. Untar them in ~/ and
>
> export AMBERHOME=~/amber14
> ./update_amber --update
> ./configure -cuda gnu
> make -j8 install
> make clean
> ./configure -cuda -mpi gnu
> make -j8 install
> make clean
>
>
> Then have your runscript be real simple like:
>
> #!/bin/bash
> #SBATCH --job-name="testB1"
> #SBATCH --output="comet.%j.%N.out"
> #SBATCH --partition=gpu
> #SBATCH --nodes=1
> #SBATCH --ntasks-per-node=4
> #SBATCH --no-requeue
> #SBATCH --gres=gpu:4
> #SBATCH --export=ALL
> #SBATCH -t 00:10:000
> #SBATCH -A TG-TRA130030
> #SBATCH --mail-type=begin
> #SBATCH --mail-type=end
>
> module load gnu/4.4.7
> module load mpich3
> module load cuda/6.5
>
> hostname
> nvidia-smi (-pm 1 if you can - may need root in which case leave it out)
>
> export AMBERHOME=~/amber14
>
> cd job1
> export CUDA_VISIBLE_DEVICES=0,1
> mpirun -np 2 $AMBERHOME/bin/pmemd.cuda.MPI -O -i ... >& job1.log &
>
> cd ../job2
> export CUDA_VISIBLE_DEVICES=2,3
> mpirun -np 2 $AMBERHOME/bin/pmemd.cuda.MPI -O -i ... >& job2.log &
>
> wait
>
> Hope that helps.
>
> All the best
> Ross
>
>
> > On Oct 22, 2015, at 20:39, Kenneth Huang <kennethneltharion.gmail.com>
> wrote:
> >
> > Hi Ross,
> >
> > Right, I suppose functionally identical would be a better description.
> >
> > First thing first is to run jobs with 'IDENTICAL' input on both sets of
> >> GPUs. If you see it fail on one set but not the other then it means it
> is a
> >> machine configuration issue / bios / etc and I can escalate it to SDSC
> >> support.
> >>
> >
> > That's what happens when it fails- I haven't been able to see it fail on
> > both jobs yet, which might be an issue of running more tests. Whatever is
> > the first job in the script seems to be the one to fail when the error
> pops
> > up.
> >
> > With identical inputs for both GPUs using 06B, the first job in the
> script
> > failed while the second was able to run. But resubmitting the exact same
> > job, both jobs ran without any issues. Doing the same thing for 06A
> didn't
> > see any issues on two tries, even though that job was the one originally
> > failing in the past (possibly because it was first).
> >
> > If I try with different inputs, ie swapping 06B on GPUs 0,1 and 06A on
> 2,3,
> > then when it's the one on top (06B) that fails. Likewise, the bottom one
> > that previously kept hitting the MPI error (06A) no longer has any
> issues.
> >
> > I actually opened a ticket with SDSC support through the XSEDE help desk
> > earlier this week about this and some bizarre performance drops on one of
> > the GPU nodes, but we couldn't figure out if this problem was a bug or a
> > resource issue, so I figured to check and see.
> >
> >
> > If it fails on both (or runs fine on both) then it says it is something
> >> with your job and we can attempt to find if there is a bug in the GPU
> code
> >> or something weird about your input. To do this though I need input that
> >> fails on any combination of 2 GPUs.
> >
> >
> > That's the part I can't get my head around is that the error doesn't seem
> > to be consistent? The 06B job mentioned above used the below script and
> > failed with first job, but work fine on both when I resubmitted it
> without
> > changing anything.
> >
> > #!/bin/bash
> > #SBATCH --job-name="testB1"
> > #SBATCH --output="comet.%j.%N.out"
> > #SBATCH --partition=gpu
> > #SBATCH --nodes=1
> > #SBATCH --ntasks-per-node=24
> > #SBATCH --no-requeue
> > #SBATCH --gres=gpu:4
> > #SBATCH --export=ALL
> > #SBATCH -t 00:10:000
> > #SBATCH -A TG-TRA130030
> > #SBATCH --mail-type=begin
> > #SBATCH --mail-type=end
> >
> > module load amber/14
> > module load intel/2015.2.164
> > module load cuda/6.5
> > module load mvapich2_gdr
> >
> > export SLURM_NODEFILE=`generate_pbs_nodefile`
> > mpirun_rsh -hostfile $SLURM_NODEFILE -np 2 MV2_USE_CUDA=1
> > MV2_USE_GPUDIRECT_GDRCOPY=0 MV2_CPU_MAPPING=0:2 CUDA_VISIBLE_DEVICES=0,1
> > /share/apps/gpu/amber/pmemd.cuda.MPI -O -i 06B_prod.in -o testB1.out -p
> > B.prmtop -c 06B_preprod2.rst -r testB1.rst -x testB1.nc -inf
> testB1.mdinfo
> > -l testB1.log &
> >
> > mpirun_rsh -hostfile $SLURM_NODEFILE -np 2 MV2_USE_CUDA=1
> > MV2_USE_GPUDIRECT_GDRCOPY=0 MV2_CPU_MAPPING=1:3 CUDA_VISIBLE_DEVICES=2,3
> > /share/apps/gpu/amber/pmemd.cuda.MPI -O -i 06B_prod.in -o testB2.out -p
> > B.prmtop -c 06B_preprod2.rst -r testB2.rst -x testB2.nc -inf
> testB2.mdinfo
> > -l testB2.log &
> >
> > wait
> >
> >
> >
> > Best,
> >
> > Kenneth
> >
> > On Thu, Oct 22, 2015 at 12:16 PM, Ross Walker <rosscwalker.gmail.com>
> wrote:
> >
> >> Hi Kenneth,
> >>
> >>> Yes, both the inputs and systems themselves are almost identical- 06B
> >> has a
> >>> ligand that 06A doesn't have, so the only difference in the inputs is
> the
> >>> nmr restraint file that they refer to.
> >>>
> >>
> >> So they are not the same. There is no such thing as 'almost' identical.
> >> Same as there is no such thing as 'almost' unique. The terms identical
> and
> >> unique are absolute adjectives. They can be true or false but nothing in
> >> between. The same is true of the word 'perfect' - although I note that
> even
> >> the US constitution gets this wrong with the phrase "..in order to form
> a
> >> more perfect union..."
> >>
> >> First thing first is to run jobs with 'IDENTICAL' input on both sets of
> >> GPUs. If you see it fail on one set but not the other then it means it
> is a
> >> machine configuration issue / bios / etc and I can escalate it to SDSC
> >> support.
> >>
> >> If it fails on both (or runs fine on both) then it says it is something
> >> with your job and we can attempt to find if there is a bug in the GPU
> code
> >> or something weird about your input. To do this though I need input that
> >> fails on any combination of 2 GPUs.
> >>
> >> All the best
> >> Ross
> >>
> >> /\
> >> \/
> >> |\oss Walker
> >>
> >> ---------------------------------------------------------
> >> | Associate Research Professor |
> >> | San Diego Supercomputer Center |
> >> | Adjunct Associate Professor |
> >> | Dept. of Chemistry and Biochemistry |
> >> | University of California San Diego |
> >> | NVIDIA Fellow |
> >> | http://www.rosswalker.co.uk | http://www.wmd-lab.org |
> >> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
> >> ---------------------------------------------------------
> >>
> >> Note: Electronic Mail is not secure, has no guarantee of delivery, may
> not
> >> be read every day, and should not be used for urgent or sensitive
> issues.
> >>
> >>
> >> _______________________________________________
> >> AMBER mailing list
> >> AMBER.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber
> >>
> >
> >
> >
> > --
> > Ask yourselves, all of you, what power would hell have if those
> imprisoned
> > here could not dream of heaven?
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
--
Ask yourselves, all of you, what power would hell have if those imprisoned
here could not dream of heaven?
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Oct 27 2015 - 10:00:05 PDT