Re: [AMBER] CUDA MPI issues

From: Ross Walker <ross.rosswalker.co.uk>
Date: Thu, 26 May 2016 10:57:14 -0700

Hi Biplab,

No worries... So I took a look at the motherboard manual here:

http://www.supermicro.com/support/resources/results.cfm <http://www.supermicro.com/support/resources/results.cfm>

If you look at page 10 of that manual it has a block diagram for how the motherboard is wired.



Note it shows two PCI-E x16 slots (1 and 3) connected to the left CPU and a PCI-E x16 slot and two PCI-E x8 slots connected to the right CPU (slots 5,4 and 2). Right now you likely have one GPU in slot 5 and one in slot 1 or 3. This configuration will run 2 x pmemd.cuda jobs fine but if you want a single job to span 2 gpus then you can't do this (and get performance) unless they are on the same CPU. Thus you would need to remove the GPU from PCI-E slot 5 and plug it into whichever of slot 1 or 3 is available. Of course this assumes that there is physical room in the case to do this - and you might need to move any cards in the other slots (network etc) if they are present. Looking at the motherboard picture page 3:



It looks like there could be an issue with using slot 1 for a double wide card since it may foul with the cabling connected to the pins on the edge of the motherboard. If there isn't room then this would explain why they used slot-3 and slot-5.

So, if you are comfortable moving GPUs around you could try this yourself and see if there is room but you might want to inquire with the vendor if it can be supported and if they can help do it.

Ultimately this was not a very good choice of motherboard for optimum GPU computing - I certainly wouldn't have picked this one. If you plan on getting more machines let me know and I'll make sure you get optimum designs.

All the best
Ross

> On May 26, 2016, at 10:40 AM, Biplab Ghosh <ghosh.biplab.gmail.com> wrote:
>
> Dear Dr. Ross Walker,
>
> Thank you very much for your elaborate reply.
>
> I'm really sorry for replying the entire digest! That was my first reply in this forum!
>
> I will compile Amber and openmpi with gnu compiler and test again.
>
> I think, it was a big mistake to buy this system. I was not enough knowledgeable on this, so I totally relied on the vendor.
>
> Would you please elaborate little bit on :
>
> "I would suggest going back to them and asking them to reorganize the GPUs for you such that they are on the same PCI-E domain"
>
> Do you mean to replace the mother board or with some trick the same board may be used?
>
> With my regards,
> Biplab.
>
>
>
>
>
>
> On Thursday 26 May 2016, Ross Walker <ross.rosswalker.co.uk <mailto:ross.rosswalker.co.uk>> wrote:
> Hi Biplab
>
> If you are getting daily digests please do not reply to the entire digest - just include the messages relevant to your thread. Your messages get reflected to 3000+ people on the AMBER list - it helps not to necessarily fill everyone's inboxes and overly strain the server. Thanks.
>
>> Thank you very much for your suggestion. The issues are as follows:
>>
>> 1. pmemd.cuda.MPI is not working in my system:
>> ==================================
>>
>> I recompiled my amber14 using (earlier I had compiled with gnu compiler)-
>> CUDA Version 7.5.18
>> Intel parallel studio XE 2016 update2
>> openmpi-1.10.2
>>
>> Compilation went smooth but some of the tests failed. I have attached the
>> test logs and diffs (file: 2016-05-24_16-51-23.*). To test individual GPU, I
>> ran the code
>>
>> https://dl.dropboxusercontent.com/u/708185/GPU_Validation_Test.tar.gz <https://dl.dropboxusercontent.com/u/708185/GPU_Validation_Test.tar.gz>
>>
>> The results are also attached.
>>
>
> Your GPU*.log files look fine. Note in the small case it flips between two numbers:
>
> 0.0: Etot = -58221.4314 EKtot = 14330.8203 EPtot = -72552.2517
> 0.1: Etot = -58221.4314 EKtot = 14330.8203 EPtot = -72552.2517
> 0.2: Etot = -58222.6006 EKtot = 14453.8398 EPtot = -72676.4405
> 0.3: Etot = -58221.4314 EKtot = 14330.8203 EPtot = -72552.2517
> 0.4: Etot = -58222.6006 EKtot = 14453.8398 EPtot = -72676.4405
> 0.5: Etot = -58222.6006 EKtot = 14453.8398 EPtot = -72676.4405
> 0.6: Etot = -58221.4314 EKtot = 14330.8203 EPtot = -72552.2517
> 0.7: Etot = -58222.6006 EKtot = 14453.8398 EPtot = -72676.4405
> 0.8: Etot = -58222.6006 EKtot = 14453.8398 EPtot = -72676.4405
> 0.9: Etot = -58221.4314 EKtot = 14330.8203 EPtot = -72552.2517
>
> This is a bug in the Intel compilers that I've never been able to track down. It seems to be benign but since the Intel compilers don't help with GPU performance I tend to just stick with the GNU compilers for the GPU code.
>
> The failures in the test cases are benign. These false negatives are something that has been much improved in AMBER 16.
>
>> Then I was trying to run the basic tutorial: http://ambermd.org/tutorials/basic/tutorial0/ <http://ambermd.org/tutorials/basic/tutorial0/> using the command:
>>
>> mpirun -np 2 pmemd.cuda.MPI -O -i 01_Min.in -o 01_Min.out -p prmtop -c inpcrd -r 01_Min.rst -inf 01_Min.mdinfo
>>
>> The errors are attached in the file: pmemd.cuda.MPI.err
>>
>
> Please see item 9 under supported features on the AMBER GPU webpage: http://ambermdorg/gpus/ <http://ambermdorg/gpus/>
>
> 9) imin=1 (in parallel) Minimization is only supported in the serial GPU code.
>
> Also read the warnings on there about using the GPU code for minimization.
>
>>
>> My second problem is as follows:
>>
>> 2. p2p not working between the GPU 0,1
>> ===========================
>>
>> The system was purchased from supermicro and details are available
>> at the following URL. The vendor says, the board supports p2p.
>>
>>
>> http://www.supermicro.com/products/system/4U/7048/SYS-7048A-T.cfm <http://www.supermicro.com/products/system/4U/7048/SYS-7048A-T.cfm>
>> http://www.supermicro.com/products/motherboard/Xeon/C600/X10DAi.cfm <http://www.supermicro.com/products/motherboard/Xeon/C600/X10DAi.cfm>
>
>
>> However, when I run gpuP2PCheck, I am getting the following outputs:
>>
>> CUDA_VISIBLE_DEVICES="0,1"
>> CUDA-capable device count: 2
>> GPU0 "GeForce GTX TITAN X"
>> GPU1 "GeForce GTX TITAN X"
>>
>> Two way peer access between:
>> GPU0 and GPU1: NO
>
>
> Well clearly they don't know how to properly build or test a GPU workstation for optimum performance. The board is dual socket - it thus has two PCI-E domains that cannot communicate between each other - see the following writeup I did for Exxact that should explain it:
>
> http://exxactcorp.com/blog/exploring-the-complexities-of-pcie-connectivity-and-peer-to-peer-communication/ <http://exxactcorp.com/blog/exploring-the-complexities-of-pcie-connectivity-and-peer-to-peer-communication/>
>
> See the initial discussion under 'Traditional approach' which is what they sold you here. With a GPU on one CPU socket and a GPU on the other CPU socket.
>
> I would suggest going back to them and asking them to reorganize the GPUs for you such that they are on the same PCI-E domain.
>
> All the best
> Ross
>
>
> /\
> \/
> |\oss Walker
>
> ---------------------------------------------------------
> | Associate Research Professor |
> | San Diego Supercomputer Center |
> | Adjunct Associate Professor |
> | Dept. of Chemistry and Biochemistry |
> | University of California San Diego |
> | NVIDIA Fellow |
> | http://www.rosswalker.co.uk <http://www.rosswalker.co.uk/> | http://www.wmd-lab.org <http://www.wmd-lab.org/> |
> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk <javascript:_e(%7B%7D,'cvml','ross.rosswalker.co.uk');> |
> ---------------------------------------------------------
>
> Note: Electronic Mail is not secure, has no guarantee of delivery, may not be read every day, and should not be used for urgent or sensitive issues.
>
>
>
> --
> Sent from my iPad

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu May 26 2016 - 11:00:03 PDT
Custom Search