Re: [AMBER] CUDA MPI issues from Ross Walker on 2016-05-26 (Amber Archive May 2016)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Thu, 26 May 2016 09:20:10 -0700

Hi Biplab

If you are getting daily digests please do not reply to the entire digest - just include the messages relevant to your thread. Your messages get reflected to 3000+ people on the AMBER list - it helps not to necessarily fill everyone's inboxes and overly strain the server. Thanks.

> Thank you very much for your suggestion. The issues are as follows:
>
> 1. pmemd.cuda.MPI is not working in my system:
> ==================================
>
> I recompiled my amber14 using (earlier I had compiled with gnu compiler)-
> CUDA Version 7.5.18
> Intel parallel studio XE 2016 update2
> openmpi-1.10.2
>
> Compilation went smooth but some of the tests failed. I have attached the
> test logs and diffs (file: 2016-05-24_16-51-23.*). To test individual GPU, I
> ran the code
>
> https://dl.dropboxusercontent.com/u/708185/GPU_Validation_Test.tar.gz <https://dl.dropboxusercontent.com/u/708185/GPU_Validation_Test.tar.gz>
>
> The results are also attached.
>

Your GPU*.log files look fine. Note in the small case it flips between two numbers:

0.0: Etot = -58221.4314 EKtot = 14330.8203 EPtot = -72552.2517
0.1: Etot = -58221.4314 EKtot = 14330.8203 EPtot = -72552.2517
0.2: Etot = -58222.6006 EKtot = 14453.8398 EPtot = -72676.4405
0.3: Etot = -58221.4314 EKtot = 14330.8203 EPtot = -72552.2517
0.4: Etot = -58222.6006 EKtot = 14453.8398 EPtot = -72676.4405
0.5: Etot = -58222.6006 EKtot = 14453.8398 EPtot = -72676.4405
0.6: Etot = -58221.4314 EKtot = 14330.8203 EPtot = -72552.2517
0.7: Etot = -58222.6006 EKtot = 14453.8398 EPtot = -72676.4405
0.8: Etot = -58222.6006 EKtot = 14453.8398 EPtot = -72676.4405
0.9: Etot = -58221.4314 EKtot = 14330.8203 EPtot = -72552.2517

This is a bug in the Intel compilers that I've never been able to track down. It seems to be benign but since the Intel compilers don't help with GPU performance I tend to just stick with the GNU compilers for the GPU code.

The failures in the test cases are benign. These false negatives are something that has been much improved in AMBER 16.

> Then I was trying to run the basic tutorial: http://ambermd.org/tutorials/basic/tutorial0/ <http://ambermd.org/tutorials/basic/tutorial0/> using the command:
>
> mpirun -np 2 pmemd.cuda.MPI -O -i 01_Min.in -o 01_Min.out -p prmtop -c inpcrd -r 01_Min.rst -inf 01_Min.mdinfo
>
> The errors are attached in the file: pmemd.cuda.MPI.err
>

Please see item 9 under supported features on the AMBER GPU webpage: http://ambermdorg/gpus/ <http://ambermdorg/gpus/>

9) imin=1 (in parallel) Minimization is only supported in the serial GPU code.

Also read the warnings on there about using the GPU code for minimization.

>
> My second problem is as follows:
>
> 2. p2p not working between the GPU 0,1
> ===========================
>
> The system was purchased from supermicro and details are available
> at the following URL. The vendor says, the board supports p2p.
>
>
> http://www.supermicro.com/products/system/4U/7048/SYS-7048A-T.cfm <http://www.supermicro.com/products/system/4U/7048/SYS-7048A-T.cfm>
> http://www.supermicro.com/products/motherboard/Xeon/C600/X10DAi.cfm <http://www.supermicro.com/products/motherboard/Xeon/C600/X10DAi.cfm>

> However, when I run gpuP2PCheck, I am getting the following outputs:
>
> CUDA_VISIBLE_DEVICES="0,1"
> CUDA-capable device count: 2
> GPU0 "GeForce GTX TITAN X"
> GPU1 "GeForce GTX TITAN X"
>
> Two way peer access between:
> GPU0 and GPU1: NO

Well clearly they don't know how to properly build or test a GPU workstation for optimum performance. The board is dual socket - it thus has two PCI-E domains that cannot communicate between each other - see the following writeup I did for Exxact that should explain it:

http://exxactcorp.com/blog/exploring-the-complexities-of-pcie-connectivity-and-peer-to-peer-communication/ <http://exxactcorp.com/blog/exploring-the-complexities-of-pcie-connectivity-and-peer-to-peer-communication/>

See the initial discussion under 'Traditional approach' which is what they sold you here. With a GPU on one CPU socket and a GPU on the other CPU socket.

I would suggest going back to them and asking them to reorganize the GPUs for you such that they are on the same PCI-E domain.

All the best
Ross

/\
\/
|\oss Walker

---------------------------------------------------------
| Associate Research Professor |
| San Diego Supercomputer Center |
| Adjunct Associate Professor |
| Dept. of Chemistry and Biochemistry |
| University of California San Diego |
| NVIDIA Fellow |
| http://www.rosswalker.co.uk | http://www.wmd-lab.org |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
---------------------------------------------------------

Note: Electronic Mail is not secure, has no guarantee of delivery, may not be read every day, and should not be used for urgent or sensitive issues.

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu May 26 2016 - 09:30:03 PDT