Re: [AMBER] Amber16 Parallel CUDA Tests

From: Steven Ford <sford123.ibbr.umd.edu>
Date: Mon, 25 Jul 2016 15:06:20 -0400

Ross,

This is CentOS version 6.7 with kernel version 2.6.32-573.22.1.el6.x86_64.

The output of nvidia-smi is:

+------------------------------------------------------+

| NVIDIA-SMI 352.79 Driver Version: 352.79 |

|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr.
ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute
M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:05:00.0 Off |
 Off |
| N/A 34C P0 59W / 149W | 56MiB / 12287MiB | 0%
 Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0000:06:00.0 Off |
 Off |
| N/A 27C P0 48W / 149W | 56MiB / 12287MiB | 0%
 Default |
+-------------------------------+----------------------+----------------------+


+-----------------------------------------------------------------------------+
| Processes: GPU
Memory |
| GPU PID Type Process name Usage
   |
|=============================================================================|
| No running processes found
  |
+-----------------------------------------------------------------------------+

The version of nvcc:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17

I used the GNU compilers, version 4.4.7.

I am using OpenMPI version 1.8.1-5.el6 from the CentOS repository. I have
not tried any other MPI installation.

Output of mpif90 --showme:

gfortran -I/usr/include/openmpi-x86_64 -pthread -I/usr/lib64/openmpi/lib
-Wl,-rpath -Wl,/usr/lib64/openmpi/lib -Wl,--enable-new-dtags
-L/usr/lib64/openmpi/lib -lmpi_usempi -lmpi_mpifh -lmpi


I set DO_PARALLEL to "mpirun -np 2"

The parallel tests for the CPU were all successful.

I had not run 'make clean' in between each step. I tried the tests again
this morning after running 'make clean' and got the same result.

I applied all patches this morning before testing again. I am using
AmberTools 16.10 and Amber 16.04


Thanks,

Steve

On Sat, Jul 23, 2016 at 6:32 PM, Ross Walker <ross.rosswalker.co.uk> wrote:

> Hi Steven,
>
> This is a large number of very worrying failures. Something is definitely
> very wrong here and I'd like to investigate further. Can you give me some
> more details about your system please. This includes:
>
> The specifics of what version of Linux you are using.
>
> The output of nvidia-smi
>
> nvcc -V (might be lower case v to get version info).
>
> Did you use the GNU compilers or the Intel compilers and in either case
> which version?
>
> OpenMPI - can you confirm the version again and also send me the output of
> mpif90 --showme (it might be --show or -show or something similar) -
> essentially I want to see what the underlying compilation line is.
>
> Can you confirm what you had $DO_PARALLEL set to when you ran make test
> for the parallel GPU build. Also can you confirm if the regular (CPU)
> parallel build passed the tests please?
>
> Also did you run 'make clean' before each build step? E.g.
>
> ./configure -cuda gnu
> make -j8 install
> make test
> *make clean*
>
> ./configure -cuda -mpi gnu
> make -j8 install
> make test
>
> Have you tried any other MPI installations? - E.g. MPICH?
>
> And finally can you please confirm which version of Amber (and AmberTools)
> this is and which patches have been applied?
>
> Thanks.
>
> All the best
> Ross
>
> On Jul 21, 2016, at 14:20, Steven Ford <sford123.ibbr.umd.edu> wrote:
>
> Ross,
>
> Attached are the log and diff files. Thank you for taking a look.
>
> Regards,
>
> Steve
>
> On Thu, Jul 21, 2016 at 5:34 AM, Ross Walker <ross.rosswalker.co.uk>
> wrote:
>
>> Hi Steve,
>>
>> Indeed that is too big a difference to just be rounding error - although
>> if those tests are using Langevin or Anderson for the thermostat that would
>> explain it (different random number streams) - although those tests are
>> supposed to be skipped in parallel.
>>
>> Can you send me a copy directly of your .log and .dif files for the 2 GPU
>> run and I'll take a closer look at it.
>>
>> All the best
>> Ross
>>
>> > On Jul 20, 2016, at 21:19, Steven Ford <sford123.ibbr.umd.edu> wrote:
>> >
>> > Hello All,
>> >
>> > I currently trying to get Amber16 installed and running on our computing
>> > cluster. Our researchers are primarily interested in running the GPU
>> > accelerated programs. For GPU computing jobs, we have one CentOS 6.7
>> node
>> > with a Tesla K80.
>> >
>> > I was able to build Amber16 and run the Serial/Parallel CPU plus the
>> Serial
>> > GPU tests with all file comparisons passing. However, only 5 parallel
>> GPU
>> > tests succeeded, while the other 100 comparisons failed.
>> >
>> > Examining the diff file shows that some of the numbers are not off by
>> much
>> > like the documentation said could happen. For example:
>> >
>> > 66c66
>> > < NSTEP = 1 TIME(PS) = 50.002 TEMP(K) = 351.27 PRESS
>> =
>> > 0.
>> >> NSTEP = 1 TIME(PS) = 50.002 TEMP(K) = 353.29 PRESS =
>> > 0.
>> >
>> > This may also be too large to attribute to a rounding error, but it is a
>> > small difference compared to others:
>> >
>> > 85c85
>> > < Etot = -217.1552 EKtot = 238.6655 EPtot =
>> > -455.8207
>> >> Etot = -1014.2562 EKtot = 244.6242 EPtot =
>> > -1258.8804
>> >
>> > This was build with CUDA 7.5, OpenMPI 1.8, and run with
>> DO_PARALLEL="mpirun
>> > -np 2"
>> >
>> > Any idea what else could be affecting the output?
>> >
>> > Thanks,
>> >
>> > Steve
>> >
>> > --
>> > Steven Ford
>> > IT Infrastructure Specialist
>> > Institute for Bioscience and Biotechnology Research
>> > University of Maryland
>> > (240)314-6405
>> > _______________________________________________
>> > AMBER mailing list
>> > AMBER.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>
>
>
> --
> Steven Ford
> IT Infrastructure Specialist
> Institute for Bioscience and Biotechnology Research
> University of Maryland
> (240)314-6405
> <2016-07-20_11-17-52.diff><2016-07-20_11-17-52.log>
>
>
>


-- 
Steven Ford
IT Infrastructure Specialist
Institute for Bioscience and Biotechnology Research
University of Maryland
(240)314-6405
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Jul 25 2016 - 12:30:02 PDT
Custom Search