Re: [AMBER] performance of pmemd.cuda.MPI vs pmemd.cuda both running on single GPU

From: Niel Henriksen <>
Date: Mon, 15 Oct 2012 16:39:12 +0000


This may be of only limited use for you. I have been investigating
a small rna system using both regularl GPU MD and GPU REMD.

System: Small RNA in TIP3P water, 7622 atoms

On a Kepler Quadro K5000 (conventional MD) I get 82 ns/day

On the Keeneland supercomputer, with Tesla M2090's
(specs: )
using 24 replicas (ie 24 GPUs) I get 85 ns/day.

I'm not comparing exactly the same GPUs, but probably close enough.

Your speed problems may be related to lack of IB and exchange frequency.
Bottom line: Good performance is possible. =)


From: []
Sent: Monday, October 15, 2012 1:22 AM
Subject: [AMBER] performance of pmemd.cuda.MPI vs pmemd.cuda both running on single GPU

Dear all,

we are using AMBER12 for calculations on GPUs. Most of our cards are GTX480
and GTX580 that are part of clusters lacking IB inter-connections. Therefor
until recently we used just serial version of pmemd.cuda for classical MD
simulations and were very happy with very fast simulations.

After the release of AMBER12 with Kepler’s patch9, I was interested in
testing REMD simulations on GPUs. I did not expect any heavy I/O traffic
upon replica exchanging, so I tried it on our clusters lacking IB inter-
connections. I found that REMD with each replica running on single GPU was
almost twice slower compared to standard single GPU MD run (which is
actually still very good). However when I wanted to explicitly check the
effect of communications in REMD and run classical simulation of the same
system with single GPU, I found that this slow-down is due to shifting from
serial pmemd.cuda to parallel pmemd.cuda.MPI and not due to shifting from
single GPU run to REMD. I took (by the accident) the pmemd.cuda.MPI binary
for classical simulation on single GPU (I guess it was not possible in older
versions) and realized that it is as fast/slow as REMD simulations but twice
slower compare to simulation with serial pmemd.cuda. Please, do you have any
idea, what is going on? Do you have the same experience or is there some
problem with our compilation and/or hardware?

At the beginning I thought the problem was openMPI, but I tested it with
AMBER compiled with MPICH 1.5rc1 and obtained the same results. Than I
thought the problem might be the bottleneck in communication through PCI as
I did my tests on old cluster with PCI 2.0 x16, but few days ago I tested it
on our most recent cluster having PCI 3.0 with the same result.

Could anyone please advise us? I already tested openMPI-1.4.1 and mpich2-1.5
rc1, both of them compiled with inter compiler (mkl ver.; the same
compiler was used for AMBER compilation) and cuda4.2.

Thank you for any comment or suggestions

Pavel Banas

Palacky University Olomouc, Czech Republic

Pavel Banáš
AMBER mailing list
AMBER mailing list
Received on Mon Oct 15 2012 - 10:00:03 PDT
Custom Search