Re: [AMBER] speed of amber12 from Ross Walker on 2013-07-14 (Amber Archive Jul 2013)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Sun, 14 Jul 2013 17:21:22 -0700

Hi Jorgen,

Unfortunately Myrinet 10G is a pretty sucky interconnect so don't expect
miracles, especially if they are also sending I/O traffic over the same
interconnect.

First though I'd make your you are using the correct MPI. I'd use the
Intel compilers for the CPU code if you have them available. I'd then
check what mpi you have. Ideally you want to be using MVAPICH2 and make
sure it is configured to use the actual myrinet interconnect. Here it
looks like you might be going over some gigabit default network, or at the
very least with TCP/IP wrapping over the myrinet connection. I am assuming
each node is 16 cores (REAL cores, not hyper threaded cores?) and thus you
see a massive slow down as soon as you move to more than 1 node. This
definitely suggests that the MPI is incorrect and using TCP/IP.

Note NPT with ntt=3 (and NO ig=-1) is the absolute worst case you can make
for scaling. At the very least set ig=-1 so it doesn't has to synchronize
the random number stream on every step.

All the best
Ross

On 7/14/13 3:54 PM, "Jorgen Simonsen" <jorgen589.gmail.com> wrote:

>Hi all,
>
>I have just compiled amber12 with gcc and mpi ( gcc-version ) on one of
>our
>cluster and on a cray-system with pgi-compilers.
>
>To test the speed and scaling of amber - I run a NPT calculation ( input
>file below ) with 64852 atoms in the system which is composed of a ligand,
>ions, explicit water and a protein:
>
>1ns MD
> &cntrl
> imin = 0, irest = 1, ntx = 7,
> ntb = 2, pres0 = 1.0, ntp = 1,
> taup = 2.0,
> cut = 10.0, ntr = 0,
> ntc = 2, ntf = 2,
> tempi = 300.0, temp0 = 300.0,
> ntt = 3, gamma_ln = 1.0,
> nstlim = 5000000, dt = 0.002,
> ntpr = 5000, ntwx = 5000, ntwr = 5000
>/
>
>but the scaling I get is not very good - so I was wondering what kind of
>speeds to expect
>On cluster 1 which has the following specifications:
>
>Myrinet 10G network connects all nodes in a topology appropriate for
>latency-sensitive parallel codes while also supporting I/O bandwidth for
>data-intensive workloads. Each compute rack supports a total of 56 nodes
>split among four IBM Blade Center H chassis. Additional racks are
>reserved
>for storage, servers, and networking
>I run the simulation for 5000 steps before it is terminated and I get the
>following numbers with sander.MPI :
>
>16 cpus: ns/day = 1.00
>32 cpus: ns/day = 0.24
>
>which suggest that I am doing something completely wrong and with pmemd:
>16 cpus: ns/day = 0.21
>32 cpus: ns/day = 0.21
>
>On the cray I get the following
>32 cpus: ns/day = 1.23
>64 cpus: ns/day = 1.42
>
>Any help to improve the speed cause when I see the benchmark numbers from
>Ross Walker they are quite different and much better so any improvement
>would be great.
>
>Thanks and sorry for any lack of information
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sun Jul 14 2013 - 17:30:02 PDT