Re: [AMBER] CUDA MPI issues from Biplab Ghosh on 2016-05-26 (Amber Archive May 2016)

From: Biplab Ghosh <ghosh.biplab.gmail.com>
Date: Thu, 26 May 2016 23:10:56 +0530

Dear Dr. Ross Walker,

Thank you very much for your elaborate reply.

I'm really sorry for replying the entire digest! That was my first reply in
this forum!

I will compile Amber and openmpi with gnu compiler and test again.

I think, it was a big mistake to buy this system. I was not enough
knowledgeable on this, so I totally relied on the vendor.

Would you please elaborate little bit on :

"I would suggest going back to them and asking them to reorganize the GPUs
for you such that they are on the same PCI-E domain"

Do you mean to replace the mother board or with some trick the same board
may be used?

With my regards,
Biplab.

On Thursday 26 May 2016, Ross Walker <ross.rosswalker.co.uk> wrote:

> Hi Biplab
>
> If you are getting daily digests please do not reply to the entire digest
> - just include the messages relevant to your thread. Your messages get
> reflected to 3000+ people on the AMBER list - it helps not to necessarily
> fill everyone's inboxes and overly strain the server. Thanks.
>
> Thank you very much for your suggestion. The issues are as follows:
>
> 1. pmemd.cuda.MPI is not working in my system:
> ==================================
>
> I recompiled my amber14 using (earlier I had compiled with gnu compiler)-
> CUDA Version 7.5.18
> Intel parallel studio XE 2016 update2
> openmpi-1.10.2
>
> Compilation went smooth but some of the tests failed. I have attached the
> test logs and diffs (file: 2016-05-24_16-51-23.*). To test individual GPU,
> I
> ran the code
>
> https://dl.dropboxusercontent.com/u/708185/GPU_Validation_Test.tar.gz
>
> The results are also attached.
>
>
> Your GPU*.log files look fine. Note in the small case it flips between two
> numbers:
>
> 0.0: Etot = -58221.4314 EKtot = 14330.8203 EPtot =
> -72552.2517
> 0.1: Etot = -58221.4314 EKtot = 14330.8203 EPtot =
> -72552.2517
> 0.2: Etot = -58222.6006 EKtot = 14453.8398 EPtot =
> -72676.4405
> 0.3: Etot = -58221.4314 EKtot = 14330.8203 EPtot =
> -72552.2517
> 0.4: Etot = -58222.6006 EKtot = 14453.8398 EPtot =
> -72676.4405
> 0.5: Etot = -58222.6006 EKtot = 14453.8398 EPtot =
> -72676.4405
> 0.6: Etot = -58221.4314 EKtot = 14330.8203 EPtot =
> -72552.2517
> 0.7: Etot = -58222.6006 EKtot = 14453.8398 EPtot =
> -72676.4405
> 0.8: Etot = -58222.6006 EKtot = 14453.8398 EPtot =
> -72676.4405
> 0.9: Etot = -58221.4314 EKtot = 14330.8203 EPtot =
> -72552.2517
>
> This is a bug in the Intel compilers that I've never been able to track
> down. It seems to be benign but since the Intel compilers don't help with
> GPU performance I tend to just stick with the GNU compilers for the GPU
> code.
>
> The failures in the test cases are benign. These false negatives are
> something that has been much improved in AMBER 16.
>
> Then I was trying to run the basic tutorial:
> http://ambermd.org/tutorials/basic/tutorial0/ using the command:
>
> mpirun -np 2 pmemd.cuda.MPI -O -i 01_Min.in -o 01_Min.out -p prmtop -c
> inpcrd -r 01_Min.rst -inf 01_Min.mdinfo
>
> The errors are attached in the file: pmemd.cuda.MPI.err
>
>
> Please see item 9 under supported features on the AMBER GPU webpage:
> http://ambermdorg/gpus/
>
> 9) imin=1 (in parallel) *Minimization is only supported in the serial GPU
> code.*
>
> Also read the warnings on there about using the GPU code for minimization.
>
>
> My second problem is as follows:
>
> 2. p2p not working between the GPU 0,1
> ===========================
>
> The system was purchased from supermicro and details are available
> at the following URL. The vendor says, the board supports p2p.
>
>
> http://www.supermicro.com/products/system/4U/7048/SYS-7048A-T.cfm
> http://www.supermicro.com/products/motherboard/Xeon/C600/X10DAi.cfm
>
>
>
> However, when I run gpuP2PCheck, I am getting the following outputs:
>
> CUDA_VISIBLE_DEVICES="0,1"
> CUDA-capable device count: 2
> GPU0 "GeForce GTX TITAN X"
> GPU1 "GeForce GTX TITAN X"
>
> Two way peer access between:
> GPU0 and GPU1: NO
>
>
> Well clearly they don't know how to properly build or test a GPU
> workstation for optimum performance. The board is dual socket - it thus has
> two PCI-E domains that cannot communicate between each other - see the
> following writeup I did for Exxact that should explain it:
>
>
> http://exxactcorp.com/blog/exploring-the-complexities-of-pcie-connectivity-and-peer-to-peer-communication/
>
> See the initial discussion under 'Traditional approach' which is what they
> sold you here. With a GPU on one CPU socket and a GPU on the other CPU
> socket.
>
> I would suggest going back to them and asking them to reorganize the GPUs
> for you such that they are on the same PCI-E domain.
>
> All the best
> Ross
>
>
> /\
> \/
> |\oss Walker
>
> ---------------------------------------------------------
> | Associate Research Professor |
> | San Diego Supercomputer Center |
> | Adjunct Associate Professor |
> | Dept. of Chemistry and Biochemistry |
> | University of California San Diego |
> | NVIDIA Fellow |
> | http://www.rosswalker.co.uk | http://www.wmd-lab.org |
> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk
> <javascript:_e(%7B%7D,'cvml','ross.rosswalker.co.uk');> |
> ---------------------------------------------------------
>
> Note: Electronic Mail is not secure, has no guarantee of delivery, may not
> be read every day, and should not be used for urgent or sensitive issues.
>
>

-- 
Sent from my iPad
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Thu May 26 2016 - 11:00:03 PDT