Re: [AMBER] PMEMD Performance issue on AMD cluster from Jason Swails on 2013-08-21 (Amber Archive Aug 2013)

From: Jason Swails <jason.swails.gmail.com>
Date: Wed, 21 Aug 2013 08:20:14 -0400

On Wed, Aug 21, 2013 at 1:29 AM, Sangeetha B <sangeetha.bicpu.edu.in> wrote:

> Hi to all,
>
> We are using a cluster with AMD opteron 6278 (16 core, 2.4 Ghz, 16 ML 3
> cache) on 2 nodes (4 processors on each node, total 128 cores) and 64GB RAM
> on each node to run AMBER 11. Parallel environment was created by
> installing mvapich2.
>
> While running PMEMD simulations for a system with 150k atoms on 64 cores,
> the performance starts with 1.4 ns / day. After continuing for 2-3 ns, the
> performance reduces to 1.0 ns / day or 0.8 in longer run time.

A possible explanation for this is that the density of your system started
out very low, and is systematically increasing to stabilize its density
(assuming you are running NPT, of course). If this is true, and your
system is becoming more dense, then the cost of your direct space sum
increases (especially if you are using a cutoff larger than 8 Angstroms,
although without your input file I can only speculate). As the direct
space sum is the most expensive part of most PME calculations of this size
(or smaller), the performance dip can be explained this way. You can check
this by restarting the simulation and seeing if the performance starts out
around where the last simulation ended.

It may also have nothing to do with Amber, and some other process on your
system is consuming too many resources. Or it could be some obscure issue
with the pmemd load balancer. I couldn't tell you.

> The
> performance reduces more rapidly when additional jobs are run
> simultaneously on the remaining cores.
>
> We are new to parallel computing environment. Is this to be expected? If
> not why does the performance get affected?
>

Yes, I fully expect performance to drop when you start running additional
jobs, especially scientific programs. Performance of modern scientific
software is typically memory-bound. What this means is that the memory bus
is slow compared to the processor speed, so the speed of the program is
defined more by the memory bandwidth (i.e., how quickly can the processor
be fed with data to work on from RAM) and cache misses than the actual
speed of your processor. Therefore, the more processes you have going, the
more your memory bandwidth must be 'split' to keep all of the cores fed
with the data it needs to do work, which means each individual process gets
less and less memory bandwidth dedicated for its use. So even though you
have enough processing power to run 64 jobs on each node, there will be
stiff competition for the rest of the resources on the node.

HTH,
Jason

-- 
Jason M. Swails
BioMaPS,
Rutgers University
Postdoctoral Researcher
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Wed Aug 21 2013 - 05:30:03 PDT