Re: [AMBER] Running Amber 11 simulations using pmemd.cuda.MPI from Ross Walker on 2011-07-05 (Amber Archive Jul 2011)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Tue, 5 Jul 2011 08:57:44 -0700

Hi David,

> We recently installed Amber 11 on our RHELS computational cluster. I
> build Amber 11 for both CPUs and GPUs. We have 15 computes nodes each
> with 2 Fermi GPUs installed. All these GPU nodes have QDR Mellanox
> Infiniband cards installed. One of the users and I can successfully run
> Amber simulations using pmemd.cuda.MPI over 2 GPUs (that is locally on
> one of the compute nodes) - the speed up isn't bad. On the other hand
> I've so far failed to run a simulation using multiple nodes (let's say
> over 4 GPUs). In this case, the calculation appears to hang, and I see
> very little output - apart from the GPUs being detected and general set
> up, etc, etc. I've been working with a couple of the Amber PME
> benchmarks.

Have you tested the CPU code across multiple nodes? I assume this scales
fine? - You should check that just to make sure. In particular make sure
things are being routed correctly over the IB interface and not TCP/IP for
example. Also make sure you aren't sharing the IB interface with NFS traffic
or IP traffic for example.

> Could anyone please advise us. I've already noted that we have a fairly
> top notch IB network - the Qlogic switch and Mellanox cards are all
> QDR. I build pmemd.cuda.MPI with the Intel compilers, cuda 3.1, and
> OpenMPI 1.3.3. Could it be that I should employ another flavor of MPI
> or that OpenMPI needs to be configured in a particular way?

1) Use CUDA 3.2, it fixes a LOT. Also make sure you are using AMBER with all
of the latest bugfixes applied. Check http://ambermd.org/

2) I highly advise AGAINST using OpenMPI. It's performance is pretty
terrible. I suggest using MVAPICH 2. We use MVAPICH2-1.5 which is what the
benchmarks on the page http://amebrmd.org/gpus/ were done with. This was 2
GPUs per node and 1 QDR IB card per node. Check that your IB card is in a
X16 slot (along with both GPUs) otherwise you won't be getting the maximum
performance out of the IB card. You should also enable GPU direct in the
MVAPICH setup which the Mellanox cards should support. I am not entirely
sure how to enable this in the MVAPICH setup though as I have never had to
build the cluster software stack myself. You can check with Mellanox
directly, they should have a white paper explaining how to do this.

Also start with just 1 GPU per node (use the export CUDA_VISIBLE_DEVICES on
each node and make sure your NODEFILE is setup to give processes out to each
node in turn) and see if you can scale.

Having said that before you do the above make sure you are running correctly
across the nodes. That is for 4 GPUs you should do mpirun -np 4. Make sure
the first 2 threads get given to node 0 and the next 2 to node 1. If you
have 8 core nodes and use a default NODEFILE it will end up putting all 4
threads on the first node so you end up running 4 GPU tasks on 2 GPUs and
performance is thus utterly destroyed. So check this carefully before you do
all the above. I do recommend MVAPICH2 and GPU Direct though.

All the best
Ross

/\
\/
|\oss Walker

---------------------------------------------------------
| Assistant Research Professor |
| San Diego Supercomputer Center |
| Adjunct Assistant Professor |
| Dept. of Chemistry and Biochemistry |
| University of California San Diego |
| NVIDIA Fellow |
| http://www.rosswalker.co.uk | http://www.wmd-lab.org/ |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
---------------------------------------------------------

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Jul 05 2011 - 09:00:08 PDT