Re: [AMBER] AMBER/sander parallel performance problem from Ross Walker on 2011-07-28 (Amber Archive Jul 2011)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Thu, 28 Jul 2011 17:29:10 -0700

Hi Jyh,

Unfortunately what you are seeing makes perfect sense. Consider what you are
describing here. You have 48 cores in a single node sharing, I assume, a
single QDR infiniband connection. This means each core has 1/48th of the QDR
bandwidth to itself. The problem is that people often try to save money by
maximizing the number of cores in a node but do not scale the interconnect
appropriately. With 48 cores in a single node you really want 4 or so QDR
cards per node. QDR is typically used with nodes with 8 or 12 cores maximum
per node in which case each individual core has IB bandwidth of 4 to 6 times
what you have in your system.

So, at least for sander you are almost certainly locked to a single node
unless you leave a bunch of cores idle on the node. E.g. try running 16
threads with 8 per node, or 32 threads with 8 per node (4 nodes) and I bet
you see better performance.

A few additional things to note:

1) sander is unlikely to scale much beyond 64 cores and then only with
decent interconnect between them. So 8 nodes by 8 cores per node will
probably work here but 48 cores per node = no way.

2) sander uses a binary tree when the core count is a power of 2 otherwise
it switches to a less efficient algorithm. The net result, as printed in the
output file, is that you can expect better performance when the core count
is a power of 2. Thus I suspect that 32 cores on 1 node (leaving 16 cores
idle) will give you better performance than using 48. You may even find
using 64 cores as 2 nodes by 32 cores each may show some speedup.

3) You are advised to use PMEMD if it supports the simulation you are
running. This is better optimized for parallel and will probably scale much
better. It also does not have the power of 2 core limitation that sander
has. So try this. Note though that it too will still get choked by the fact
the interconnect is not balanced for the number of cores in a node so you
may need to leave cores idle in order to obtain best performance. E.g. try
using just 16 or 32 cores per node.

In summary, use PMEMD if you can and unfortunately you cannot expect
miracles if you build hopelessly unbalanced machines. You should also run
some benchmarks for each of the simulations you wish to run since parallel
performance will be VERY dependent on the simulation you are running. Larger
simulations typically scale better. Finally consider leaving a bunch of
cores idle on each of your machines and you might get better overall
performance. E.g. 4 nodes by 32 cores per node (128 cores) will probably
outperform 4 nodes by 48 cores per node (192 cores).

Good luck,
Ross

> -----Original Message-----
> From: Jyh-Shyong [mailto:jyhshyong0.gmail.com]
> Sent: Thursday, July 28, 2011 5:00 PM
> To: amber.ambermd.org
> Subject: [AMBER] AMBER/sander parallel performance problem
>
> Dear Amber users,
>
> I just installed Amber11 on our new cluster computer, and ran some test
> cases on it.
> Each node has 48 cores, and all nodes are connected with QDR
> infiniband.
>
> I built parallel version of sander with both mvapich2-1.5 and
> openmpi-1.4.3. The
> performance of the program is quite strange:
>
> Here is a case using 1 core, it took about 1 hr and 24 min:
> | Job began at 10:00:53.602 on 07/27/2011
> | Setup done at 10:00:54.844 on 07/27/2011
> | Run done at 11:23:48.604 on 07/27/2011
>
> And the case using 48 core (one computing node), it took about 9 min
> 40s
>
> | Job began at 20:32:36.572 on 07/28/2011
> | Setup done at 20:32:38.436 on 07/28/2011
> | Run done at 20:41:17.902 on 07/28/2011
>
> It is nice.
>
> However, the case using 192 cores (4 computing nodes), it took about 12
> min. !
> | Job began at 20:20:14.506 on 07/28/2011
> | Setup done at 20:20:17.208 on 07/28/2011
> | Run done at 20:31:26.587 on 07/28/2011
>
> Something is wrong when using more than one computing node, I
> followed the installation
> guide and compile the program using both intel and gcc compiler with
> MKL
> lilbrary. I always
> got the similar result.
>
> Any hint on how this could happen and who to fix the problem?
>
> Thanks.
>
> Jyh-Shyong Ho, Ph.D.
> Research Scientist
> National Center for High Performance Computing
> Hsinchu, Taiwan, ROC
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jul 28 2011 - 17:30:07 PDT