Re: [AMBER] Running amber v11 over multiple gpus/nodes

From: Baker D.J. <D.J.Baker.soton.ac.uk>
Date: Wed, 14 Sep 2011 16:18:03 +0100

Hi Peter,

Thank you. That really clears things up for me. The technology document is particularly good and sets out things (re CUDA v4) really well. The parallel speed up of this benchmark over 4 gpus isn't that great (about 8 minutes to run the simulation vs 11.5 minutes on 2 gpus), however I suspect that it is about as good as it gets at the moment. On the other hand, looking at the bigger picture, this is pretty good.

Here are some benchmarking figures for the Amber PME/Cellulose_production_NPT benchmark on our gpu hardware:

# Benchmarking results
Conventional hardware, 8 cpus -- 4881s
Conventional parallel on 16 cpus -- 2679s

Cuda.pmemd, serial -- 961s
Cuda.pmemd.MPI, 2 gpus -- 694s
Cuda.pmemd.MPI, 4 gpus -- 524s

Best regards -- David.

-----Original Message-----
From: peter.stauffert.boehringer-ingelheim.com [mailto:peter.stauffert.boehringer-ingelheim.com]
Sent: Wednesday, September 14, 2011 12:37 PM
To: amber.ambermd.org
Subject: Re: [AMBER] Running amber v11 over multiple gpus/nodes

Hi David

CUDA_NIC_INTEROP=1 is part of the GPUdirect stuff of Nvidia, have a look on http://developer.nvidia.com/gpudirect, at the bottom of this page there is a link to a GPUdirect Technologie Overview presentation (http://developer.download.nvidia.com/compute/cuda/4_0/docs/GPUDirect_Technol
ogy_Overview.pdf), which explains the GPUdirect stuff.

Best regards

Peter

Dr. Peter Stauffert
Boehringer Ingelheim Pharma GmbH & Co. KG -----Ursprüngliche Nachricht-----
Von: Baker D.J. [mailto:D.J.Baker.soton.ac.uk]
Gesendet: Mittwoch, 14. September 2011 13:14
An: amber.ambermd.org
Betreff: [AMBER] Running amber v11 over multiple gpus/nodes

Hello,

I'm working on building Amber v11 with the latest set of bug fixes. I'm primarily interested in the cuda performance patch provided by bf17. I now can run amber simulations over multiple gpus/nodes. That is I can run the simulation over 4 gpus (2 nodes -- that is I have 2 gpus installed in each compute node). Up until this morning this simulation was crashing with a segmentation fault.

The key to getting the simulation going was to set CUDA_NIC_INTEROP=1 in the job. Could someone please help get my head around this -- I googled this solution off the web, and I think it's something to do with my NICs not understanding GPUdirect v2, however I'm not completely sure that I really understand the situation. My build environment was -- CUDA v4.0, Intel compilers, Amber v11 (bf1-17) and AmberTools 1.5. I've tried building the parallel (cuda) executable with both mvapich2-1.6 and openmpi-1.4.3. Oddly enough I find that, with one of the Amber benchmarks, I get similar performance with both these MPIs and that does surprise me. I can complete the PME/Cellulose_production_NPT benchmark in 8 minutes on 4 gpus. I'm using generic OFED v1.5.* to "power" the IB network by the way.

Best regards -- David,




_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Sep 14 2011 - 08:30:05 PDT
Custom Search