Re: [AMBER] AMBER/sander parallel performance problem

From: Jyh-Shyong <jyhshyong0.gmail.com>
Date: Fri, 29 Jul 2011 08:57:06 +0800

Hi,

Thanks for so many people replied my problem so quickly. I will try
again and see
how much improvement I can get.

This is a dilemma situation for a cluster of nodes with so many cores,
in order to
match the demand of cross node communication, multi- QDR infiniband
cards are
required, however, there is no enough PCI-E 2.0 slots on the motherboard
in the market.

Under a LSF management environment, all idle cores of a node will be
assigned for
some jobs, so it adds another barrier for improving parallel computation
scaling.

Regards

Jyh-Shyong Ho, Ph.D.
Research Scientist
National Center for High Performance Computing
Hsinchu, Taiwan, ROC


於 2011/7/29 上午 08:29, Ross Walker 提到:
> Hi Jyh,
>
> Unfortunately what you are seeing makes perfect sense. Consider what you are
> describing here. You have 48 cores in a single node sharing, I assume, a
> single QDR infiniband connection. This means each core has 1/48th of the QDR
> bandwidth to itself. The problem is that people often try to save money by
> maximizing the number of cores in a node but do not scale the interconnect
> appropriately. With 48 cores in a single node you really want 4 or so QDR
> cards per node. QDR is typically used with nodes with 8 or 12 cores maximum
> per node in which case each individual core has IB bandwidth of 4 to 6 times
> what you have in your system.
>
> So, at least for sander you are almost certainly locked to a single node
> unless you leave a bunch of cores idle on the node. E.g. try running 16
> threads with 8 per node, or 32 threads with 8 per node (4 nodes) and I bet
> you see better performance.
>
> A few additional things to note:
>
> 1) sander is unlikely to scale much beyond 64 cores and then only with
> decent interconnect between them. So 8 nodes by 8 cores per node will
> probably work here but 48 cores per node = no way.
>
> 2) sander uses a binary tree when the core count is a power of 2 otherwise
> it switches to a less efficient algorithm. The net result, as printed in the
> output file, is that you can expect better performance when the core count
> is a power of 2. Thus I suspect that 32 cores on 1 node (leaving 16 cores
> idle) will give you better performance than using 48. You may even find
> using 64 cores as 2 nodes by 32 cores each may show some speedup.
>
> 3) You are advised to use PMEMD if it supports the simulation you are
> running. This is better optimized for parallel and will probably scale much
> better. It also does not have the power of 2 core limitation that sander
> has. So try this. Note though that it too will still get choked by the fact
> the interconnect is not balanced for the number of cores in a node so you
> may need to leave cores idle in order to obtain best performance. E.g. try
> using just 16 or 32 cores per node.
>
> In summary, use PMEMD if you can and unfortunately you cannot expect
> miracles if you build hopelessly unbalanced machines. You should also run
> some benchmarks for each of the simulations you wish to run since parallel
> performance will be VERY dependent on the simulation you are running. Larger
> simulations typically scale better. Finally consider leaving a bunch of
> cores idle on each of your machines and you might get better overall
> performance. E.g. 4 nodes by 32 cores per node (128 cores) will probably
> outperform 4 nodes by 48 cores per node (192 cores).
>
> Good luck,
> Ross
>
>> -----Original Message-----
>> From: Jyh-Shyong [mailto:jyhshyong0.gmail.com]
>> Sent: Thursday, July 28, 2011 5:00 PM
>> To: amber.ambermd.org
>> Subject: [AMBER] AMBER/sander parallel performance problem
>>
>> Dear Amber users,
>>
>> I just installed Amber11 on our new cluster computer, and ran some test
>> cases on it.
>> Each node has 48 cores, and all nodes are connected with QDR
>> infiniband.
>>
>> I built parallel version of sander with both mvapich2-1.5 and
>> openmpi-1.4.3. The
>> performance of the program is quite strange:
>>
>> Here is a case using 1 core, it took about 1 hr and 24 min:
>> | Job began at 10:00:53.602 on 07/27/2011
>> | Setup done at 10:00:54.844 on 07/27/2011
>> | Run done at 11:23:48.604 on 07/27/2011
>>
>> And the case using 48 core (one computing node), it took about 9 min
>> 40s
>>
>> | Job began at 20:32:36.572 on 07/28/2011
>> | Setup done at 20:32:38.436 on 07/28/2011
>> | Run done at 20:41:17.902 on 07/28/2011
>>
>> It is nice.
>>
>> However, the case using 192 cores (4 computing nodes), it took about 12
>> min. !
>> | Job began at 20:20:14.506 on 07/28/2011
>> | Setup done at 20:20:17.208 on 07/28/2011
>> | Run done at 20:31:26.587 on 07/28/2011
>>
>> Something is wrong when using more than one computing node, I
>> followed the installation
>> guide and compile the program using both intel and gcc compiler with
>> MKL
>> lilbrary. I always
>> got the similar result.
>>
>> Any hint on how this could happen and who to fix the problem?
>>
>> Thanks.
>>
>> Jyh-Shyong Ho, Ph.D.
>> Research Scientist
>> National Center for High Performance Computing
>> Hsinchu, Taiwan, ROC
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber


_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jul 28 2011 - 21:00:03 PDT
Custom Search