Re: [AMBER] Amber16 Parallel CUDA Tests

From: Ross Walker <ross.rosswalker.co.uk>
Date: Sat, 23 Jul 2016 23:32:59 +0100

Hi Steven,

This is a large number of very worrying failures. Something is definitely very wrong here and I'd like to investigate further. Can you give me some more details about your system please. This includes:

The specifics of what version of Linux you are using.

The output of nvidia-smi

nvcc -V (might be lower case v to get version info).

Did you use the GNU compilers or the Intel compilers and in either case which version?

OpenMPI - can you confirm the version again and also send me the output of mpif90 --showme (it might be --show or -show or something similar) - essentially I want to see what the underlying compilation line is.

Can you confirm what you had $DO_PARALLEL set to when you ran make test for the parallel GPU build. Also can you confirm if the regular (CPU) parallel build passed the tests please?

Also did you run 'make clean' before each build step? E.g.

./configure -cuda gnu
make -j8 install
make test
make clean

./configure -cuda -mpi gnu
make -j8 install
make test

Have you tried any other MPI installations? - E.g. MPICH?

And finally can you please confirm which version of Amber (and AmberTools) this is and which patches have been applied?

Thanks.

All the best
Ross

> On Jul 21, 2016, at 14:20, Steven Ford <sford123.ibbr.umd.edu> wrote:
>
> Ross,
>
> Attached are the log and diff files. Thank you for taking a look.
>
> Regards,
>
> Steve
>
> On Thu, Jul 21, 2016 at 5:34 AM, Ross Walker <ross.rosswalker.co.uk <mailto:ross.rosswalker.co.uk>> wrote:
> Hi Steve,
>
> Indeed that is too big a difference to just be rounding error - although if those tests are using Langevin or Anderson for the thermostat that would explain it (different random number streams) - although those tests are supposed to be skipped in parallel.
>
> Can you send me a copy directly of your .log and .dif files for the 2 GPU run and I'll take a closer look at it.
>
> All the best
> Ross
>
> > On Jul 20, 2016, at 21:19, Steven Ford <sford123.ibbr.umd.edu <mailto:sford123.ibbr.umd.edu>> wrote:
> >
> > Hello All,
> >
> > I currently trying to get Amber16 installed and running on our computing
> > cluster. Our researchers are primarily interested in running the GPU
> > accelerated programs. For GPU computing jobs, we have one CentOS 6.7 node
> > with a Tesla K80.
> >
> > I was able to build Amber16 and run the Serial/Parallel CPU plus the Serial
> > GPU tests with all file comparisons passing. However, only 5 parallel GPU
> > tests succeeded, while the other 100 comparisons failed.
> >
> > Examining the diff file shows that some of the numbers are not off by much
> > like the documentation said could happen. For example:
> >
> > 66c66
> > < NSTEP = 1 TIME(PS) = 50.002 TEMP(K) = 351.27 PRESS =
> > 0.
> >> NSTEP = 1 TIME(PS) = 50.002 TEMP(K) = 353.29 PRESS =
> > 0.
> >
> > This may also be too large to attribute to a rounding error, but it is a
> > small difference compared to others:
> >
> > 85c85
> > < Etot = -217.1552 EKtot = 238.6655 EPtot =
> > -455.8207
> >> Etot = -1014.2562 EKtot = 244.6242 EPtot =
> > -1258.8804
> >
> > This was build with CUDA 7.5, OpenMPI 1.8, and run with DO_PARALLEL="mpirun
> > -np 2"
> >
> > Any idea what else could be affecting the output?
> >
> > Thanks,
> >
> > Steve
> >
> > --
> > Steven Ford
> > IT Infrastructure Specialist
> > Institute for Bioscience and Biotechnology Research
> > University of Maryland
> > (240)314-6405
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org <mailto:AMBER.ambermd.org>
> > http://lists.ambermd.org/mailman/listinfo/amber <http://lists.ambermd.org/mailman/listinfo/amber>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org <mailto:AMBER.ambermd.org>
> http://lists.ambermd.org/mailman/listinfo/amber <http://lists.ambermd.org/mailman/listinfo/amber>
>
>
>
> --
> Steven Ford
> IT Infrastructure Specialist
> Institute for Bioscience and Biotechnology Research
> University of Maryland
> (240)314-6405
> <2016-07-20_11-17-52.diff><2016-07-20_11-17-52.log>

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sat Jul 23 2016 - 17:30:03 PDT
Custom Search