Ross,
My guess is the compiler side, since the serial GPU hits a slightly
different message when it completes. Weird enough, the message only shows
up in the job log file when the job finishes, and never at any point before.
Note: The following floating-point exceptions are signalling:
IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
So just run the examples and benchmarks to see if anything looks weird?
Nothing in my outputs or trajectories looks off, but I'm not even sure what
I'd be looking for.
Best,
Kenneth
On Tue, Oct 27, 2015 at 1:01 PM, Ross Walker <rosscwalker.gmail.com> wrote:
> Hi Kenneth,
>
> No idea on the Underflow error - I've not seen it before - do you know
> where that signal is coming from? From the compiler side of things or from
> the MPI side of things? - as in does it do it for serial GPU runs as well
> as multi-GPU runs?
>
> If the test cases are all good I wouldn't worry about it.
>
> All the best
> Ross
>
> > On Oct 27, 2015, at 9:56 AM, Kenneth Huang <kennethneltharion.gmail.com>
> wrote:
> >
> > Hi Ross,
> >
> > Mahidhar actually went ahead and recompiled amber under gnu instead of
> > Intel, and all of the tests I've done with it seem to be working fine
> now.
> >
> > That said, I am consistently seeing a message in the job log files for
> the
> > runs- the outputs from Amber don't seem to have any issues, and there's
> > nothing so far to suggest anything is actually wrong, but I wanted to
> check
> > if it's an actual issue.
> >
> > Note: The following floating-point exceptions are signalling:
> > IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
> > Note: The following floating-point exceptions are signalling:
> IEEE_DENORMAL
> >
> > Best,
> >
> > Kenneth
> >
> > On Fri, Oct 23, 2015 at 12:37 AM, Ross Walker <ross.rosswalker.co.uk>
> wrote:
> >
> >> Hi Kenneth,
> >>
> >> A few things to try.
> >>
> >> 1) Right after the modules are loaded add: nvidia-smi -pm 1 (this
> will
> >> force loading of the nvidia driver)
> >>
> >> 2) Are you certain that the CUDA_VISIBLE_DEVICES you are specifying in
> the
> >> mpi command is getting propogated? - What does your two mdout files
> report
> >> for the value of CUDA_VISIBLE_DEVICES? If this isn't getting propogated
> >> that would explain it.
> >>
> >> 3) Ditch all this fancy mpi crap: MV2_USE_CUDA=1
> >> MV2_USE_GPUDIRECT_GDRCOPY=0 MV2_CPU_MAPPING=0:2
> >>
> >> I have no idea what these options are doing and they are probably just
> >> breaking things. I'd also ditch all the fancy module loads:
> >>
> >>> module load amber/14
> >>
> >>> module load intel/2015.2.164
> >>> module load cuda/6.5
> >>> module load mvapich2_gdr
> >>
> >> Load plain old vanilla GCC and Gfortran and vanilla mpich3 or mpich2 -
> and
> >> compile your own copy of amber 14 with:
> >>
> >> ./configure -cuda gnu
> >> make
> >> make clean
> >> ./configure -cuda -mpi gnu
> >> make
> >> make clean
> >>
> >> And you'll probably find all you problems go away. Amber GPU was written
> >> deliberately not to need ANY fancy compilers or mpi libraries, or fancy
> >> interconnects or fancy GPU direct options etc. They all just make things
> >> fragile. I don't know why the SDSC folk compiled the GPU code for the
> >> modules listed above in the first place.
> >>
> >> Ultimately you want it nice and simple. I don't have a login on Comet
> so I
> >> can't give you the exact options but something like
> >>
> >> module load gnu/4.4.7
> >> module load mpich3
> >> module load cuda/6.5
> >>
> >> copy in your own amber 14 and AmberTools15 tar files. Untar them in ~/
> and
> >>
> >> export AMBERHOME=~/amber14
> >> ./update_amber --update
> >> ./configure -cuda gnu
> >> make -j8 install
> >> make clean
> >> ./configure -cuda -mpi gnu
> >> make -j8 install
> >> make clean
> >>
> >>
> >> Then have your runscript be real simple like:
> >>
> >> #!/bin/bash
> >> #SBATCH --job-name="testB1"
> >> #SBATCH --output="comet.%j.%N.out"
> >> #SBATCH --partition=gpu
> >> #SBATCH --nodes=1
> >> #SBATCH --ntasks-per-node=4
> >> #SBATCH --no-requeue
> >> #SBATCH --gres=gpu:4
> >> #SBATCH --export=ALL
> >> #SBATCH -t 00:10:000
> >> #SBATCH -A TG-TRA130030
> >> #SBATCH --mail-type=begin
> >> #SBATCH --mail-type=end
> >>
> >> module load gnu/4.4.7
> >> module load mpich3
> >> module load cuda/6.5
> >>
> >> hostname
> >> nvidia-smi (-pm 1 if you can - may need root in which case leave it
> out)
> >>
> >> export AMBERHOME=~/amber14
> >>
> >> cd job1
> >> export CUDA_VISIBLE_DEVICES=0,1
> >> mpirun -np 2 $AMBERHOME/bin/pmemd.cuda.MPI -O -i ... >& job1.log &
> >>
> >> cd ../job2
> >> export CUDA_VISIBLE_DEVICES=2,3
> >> mpirun -np 2 $AMBERHOME/bin/pmemd.cuda.MPI -O -i ... >& job2.log &
> >>
> >> wait
> >>
> >> Hope that helps.
> >>
> >> All the best
> >> Ross
> >>
> >>
> >>> On Oct 22, 2015, at 20:39, Kenneth Huang <kennethneltharion.gmail.com>
> >> wrote:
> >>>
> >>> Hi Ross,
> >>>
> >>> Right, I suppose functionally identical would be a better description.
> >>>
> >>> First thing first is to run jobs with 'IDENTICAL' input on both sets of
> >>>> GPUs. If you see it fail on one set but not the other then it means it
> >> is a
> >>>> machine configuration issue / bios / etc and I can escalate it to SDSC
> >>>> support.
> >>>>
> >>>
> >>> That's what happens when it fails- I haven't been able to see it fail
> on
> >>> both jobs yet, which might be an issue of running more tests. Whatever
> is
> >>> the first job in the script seems to be the one to fail when the error
> >> pops
> >>> up.
> >>>
> >>> With identical inputs for both GPUs using 06B, the first job in the
> >> script
> >>> failed while the second was able to run. But resubmitting the exact
> same
> >>> job, both jobs ran without any issues. Doing the same thing for 06A
> >> didn't
> >>> see any issues on two tries, even though that job was the one
> originally
> >>> failing in the past (possibly because it was first).
> >>>
> >>> If I try with different inputs, ie swapping 06B on GPUs 0,1 and 06A on
> >> 2,3,
> >>> then when it's the one on top (06B) that fails. Likewise, the bottom
> one
> >>> that previously kept hitting the MPI error (06A) no longer has any
> >> issues.
> >>>
> >>> I actually opened a ticket with SDSC support through the XSEDE help
> desk
> >>> earlier this week about this and some bizarre performance drops on one
> of
> >>> the GPU nodes, but we couldn't figure out if this problem was a bug or
> a
> >>> resource issue, so I figured to check and see.
> >>>
> >>>
> >>> If it fails on both (or runs fine on both) then it says it is something
> >>>> with your job and we can attempt to find if there is a bug in the GPU
> >> code
> >>>> or something weird about your input. To do this though I need input
> that
> >>>> fails on any combination of 2 GPUs.
> >>>
> >>>
> >>> That's the part I can't get my head around is that the error doesn't
> seem
> >>> to be consistent? The 06B job mentioned above used the below script and
> >>> failed with first job, but work fine on both when I resubmitted it
> >> without
> >>> changing anything.
> >>>
> >>> #!/bin/bash
> >>> #SBATCH --job-name="testB1"
> >>> #SBATCH --output="comet.%j.%N.out"
> >>> #SBATCH --partition=gpu
> >>> #SBATCH --nodes=1
> >>> #SBATCH --ntasks-per-node=24
> >>> #SBATCH --no-requeue
> >>> #SBATCH --gres=gpu:4
> >>> #SBATCH --export=ALL
> >>> #SBATCH -t 00:10:000
> >>> #SBATCH -A TG-TRA130030
> >>> #SBATCH --mail-type=begin
> >>> #SBATCH --mail-type=end
> >>>
> >>> module load amber/14
> >>> module load intel/2015.2.164
> >>> module load cuda/6.5
> >>> module load mvapich2_gdr
> >>>
> >>> export SLURM_NODEFILE=`generate_pbs_nodefile`
> >>> mpirun_rsh -hostfile $SLURM_NODEFILE -np 2 MV2_USE_CUDA=1
> >>> MV2_USE_GPUDIRECT_GDRCOPY=0 MV2_CPU_MAPPING=0:2
> CUDA_VISIBLE_DEVICES=0,1
> >>> /share/apps/gpu/amber/pmemd.cuda.MPI -O -i 06B_prod.in -o testB1.out -p
> >>> B.prmtop -c 06B_preprod2.rst -r testB1.rst -x testB1.nc -inf
> >> testB1.mdinfo
> >>> -l testB1.log &
> >>>
> >>> mpirun_rsh -hostfile $SLURM_NODEFILE -np 2 MV2_USE_CUDA=1
> >>> MV2_USE_GPUDIRECT_GDRCOPY=0 MV2_CPU_MAPPING=1:3
> CUDA_VISIBLE_DEVICES=2,3
> >>> /share/apps/gpu/amber/pmemd.cuda.MPI -O -i 06B_prod.in -o testB2.out -p
> >>> B.prmtop -c 06B_preprod2.rst -r testB2.rst -x testB2.nc -inf
> >> testB2.mdinfo
> >>> -l testB2.log &
> >>>
> >>> wait
> >>>
> >>>
> >>>
> >>> Best,
> >>>
> >>> Kenneth
> >>>
> >>> On Thu, Oct 22, 2015 at 12:16 PM, Ross Walker <rosscwalker.gmail.com>
> >> wrote:
> >>>
> >>>> Hi Kenneth,
> >>>>
> >>>>> Yes, both the inputs and systems themselves are almost identical- 06B
> >>>> has a
> >>>>> ligand that 06A doesn't have, so the only difference in the inputs is
> >> the
> >>>>> nmr restraint file that they refer to.
> >>>>>
> >>>>
> >>>> So they are not the same. There is no such thing as 'almost'
> identical.
> >>>> Same as there is no such thing as 'almost' unique. The terms identical
> >> and
> >>>> unique are absolute adjectives. They can be true or false but nothing
> in
> >>>> between. The same is true of the word 'perfect' - although I note that
> >> even
> >>>> the US constitution gets this wrong with the phrase "..in order to
> form
> >> a
> >>>> more perfect union..."
> >>>>
> >>>> First thing first is to run jobs with 'IDENTICAL' input on both sets
> of
> >>>> GPUs. If you see it fail on one set but not the other then it means it
> >> is a
> >>>> machine configuration issue / bios / etc and I can escalate it to SDSC
> >>>> support.
> >>>>
> >>>> If it fails on both (or runs fine on both) then it says it is
> something
> >>>> with your job and we can attempt to find if there is a bug in the GPU
> >> code
> >>>> or something weird about your input. To do this though I need input
> that
> >>>> fails on any combination of 2 GPUs.
> >>>>
> >>>> All the best
> >>>> Ross
> >>>>
> >>>> /\
> >>>> \/
> >>>> |\oss Walker
> >>>>
> >>>> ---------------------------------------------------------
> >>>> | Associate Research Professor |
> >>>> | San Diego Supercomputer Center |
> >>>> | Adjunct Associate Professor |
> >>>> | Dept. of Chemistry and Biochemistry |
> >>>> | University of California San Diego |
> >>>> | NVIDIA Fellow |
> >>>> | http://www.rosswalker.co.uk | http://www.wmd-lab.org |
> >>>> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
> >>>> ---------------------------------------------------------
> >>>>
> >>>> Note: Electronic Mail is not secure, has no guarantee of delivery, may
> >> not
> >>>> be read every day, and should not be used for urgent or sensitive
> >> issues.
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> AMBER mailing list
> >>>> AMBER.ambermd.org
> >>>> http://lists.ambermd.org/mailman/listinfo/amber
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Ask yourselves, all of you, what power would hell have if those
> >> imprisoned
> >>> here could not dream of heaven?
> >>> _______________________________________________
> >>> AMBER mailing list
> >>> AMBER.ambermd.org
> >>> http://lists.ambermd.org/mailman/listinfo/amber
> >>
> >>
> >> _______________________________________________
> >> AMBER mailing list
> >> AMBER.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber
> >>
> >
> >
> >
> > --
> > Ask yourselves, all of you, what power would hell have if those
> imprisoned
> > here could not dream of heaven?
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
--
Ask yourselves, all of you, what power would hell have if those imprisoned
here could not dream of heaven?
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Oct 28 2015 - 20:00:03 PDT