Re: [AMBER] Running Amber 11 simulations using pmemd.cuda.MPI from Scott Le Grand on 2011-07-06 (Amber Archive Jul 2011)

From: Scott Le Grand <varelse2005.gmail.com>
Date: Wed, 6 Jul 2011 10:34:36 -0700

OpenMPI's issues with pmemd.cuda.mpi are not a gpu issue - it just can't
handle large messages without locking up - and that's for them to fix...

On Wed, Jul 6, 2011 at 8:45 AM, Baker D.J. <D.J.Baker.soton.ac.uk> wrote:

> Hi Ross,
>
> Thank you for your advice. I've spend the day working on mvapich2 and
> Amber. Rebuilding pmemd.cuda.MPI using mvapich2 is exactly what's needed.
> Using 4 GPUs (that is two compute nodes) on the PME/Cellulose_production_NPT
> benchmark example I can get the simulation done in 15 minutes. This is
> excellent scaling since the same simulation takes 30 mins using 2 GPUs.
>
> I'll need to open this testing up to some of the Amber users here before we
> can it a success. Pity that Openmpi doesn't do the job -- I'm not that keen
> to have to offer another flavor of MPI2 on the cluster. Taking a look at the
> latest version of OpenMPI, off hand, it appears that they are no way close
> to supporting GPUs properly.
>
> Best regards -- David.
>
> -----Original Message-----
> From: Ross Walker [mailto:ross.rosswalker.co.uk]
> Sent: Tuesday, July 05, 2011 4:58 PM
> To: 'AMBER Mailing List'
> Subject: Re: [AMBER] Running Amber 11 simulations using pmemd.cuda.MPI
>
> Hi David,
>
> > We recently installed Amber 11 on our RHELS computational cluster. I
> > build Amber 11 for both CPUs and GPUs. We have 15 computes nodes each
> > with 2 Fermi GPUs installed. All these GPU nodes have QDR Mellanox
> > Infiniband cards installed. One of the users and I can successfully
> > run Amber simulations using pmemd.cuda.MPI over 2 GPUs (that is
> > locally on one of the compute nodes) - the speed up isn't bad. On the
> > other hand I've so far failed to run a simulation using multiple nodes
> > (let's say over 4 GPUs). In this case, the calculation appears to
> > hang, and I see very little output - apart from the GPUs being
> > detected and general set up, etc, etc. I've been working with a couple
> > of the Amber PME benchmarks.
>
> Have you tested the CPU code across multiple nodes? I assume this scales
> fine? - You should check that just to make sure. In particular make sure
> things are being routed correctly over the IB interface and not TCP/IP for
> example. Also make sure you aren't sharing the IB interface with NFS traffic
> or IP traffic for example.
>
> > Could anyone please advise us. I've already noted that we have a
> > fairly top notch IB network - the Qlogic switch and Mellanox cards are
> > all QDR. I build pmemd.cuda.MPI with the Intel compilers, cuda 3.1,
> > and OpenMPI 1.3.3. Could it be that I should employ another flavor of
> > MPI or that OpenMPI needs to be configured in a particular way?
>
> 1) Use CUDA 3.2, it fixes a LOT. Also make sure you are using AMBER with
> all of the latest bugfixes applied. Check http://ambermd.org/
>
> 2) I highly advise AGAINST using OpenMPI. It's performance is pretty
> terrible. I suggest using MVAPICH 2. We use MVAPICH2-1.5 which is what the
> benchmarks on the page http://amebrmd.org/gpus/ were done with. This was 2
> GPUs per node and 1 QDR IB card per node. Check that your IB card is in a
> X16 slot (along with both GPUs) otherwise you won't be getting the maximum
> performance out of the IB card. You should also enable GPU direct in the
> MVAPICH setup which the Mellanox cards should support. I am not entirely
> sure how to enable this in the MVAPICH setup though as I have never had to
> build the cluster software stack myself. You can check with Mellanox
> directly, they should have a white paper explaining how to do this.
>
> Also start with just 1 GPU per node (use the export CUDA_VISIBLE_DEVICES on
> each node and make sure your NODEFILE is setup to give processes out to each
> node in turn) and see if you can scale.
>
> Having said that before you do the above make sure you are running
> correctly across the nodes. That is for 4 GPUs you should do mpirun -np 4.
> Make sure the first 2 threads get given to node 0 and the next 2 to node 1.
> If you have 8 core nodes and use a default NODEFILE it will end up putting
> all 4 threads on the first node so you end up running 4 GPU tasks on 2 GPUs
> and performance is thus utterly destroyed. So check this carefully before
> you do all the above. I do recommend MVAPICH2 and GPU Direct though.
>
>
> All the best
> Ross
>
> /\
> \/
> |\oss Walker
>
> ---------------------------------------------------------
> | Assistant Research Professor |
> | San Diego Supercomputer Center |
> | Adjunct Assistant Professor |
> | Dept. of Chemistry and Biochemistry |
> | University of California San Diego |
> | NVIDIA Fellow |
> | http://www.rosswalker.co.uk | http://www.wmd-lab.org/ |
> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
> ---------------------------------------------------------
>
> Note: Electronic Mail is not secure, has no guarantee of delivery, may not
> be read every day, and should not be used for urgent or sensitive issues.
>
>
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Jul 06 2011 - 11:00:03 PDT