Re: [AMBER] AMBER/sander parallel performance problem from Jyh-Shyong on 2011-07-28 (Amber Archive Jul 2011)

From: Jyh-Shyong <jyhshyong0.gmail.com>
Date: Fri, 29 Jul 2011 10:15:03 +0800

Hi,

I did a quick test on pmemd.MPI on our cluster of 48-core nodes,

192-core, using 6 cores per node:
chem.alps5:/work/chem/10ps-md-1> tail 10ps-192cpu-6.pmemd
| ns/day = 10.41 seconds/ns = 8302.72

| Master Setup CPU time: 0.68 seconds
| Master NonSetup CPU time: 82.96 seconds
| Master Total CPU time: 83.63 seconds 0.02 hours

| Master Setup wall time: 1 seconds
| Master NonSetup wall time: 83 seconds
| Master Total wall time: 84 seconds 0.02 hours
-----------------------------------------------------------------------------------------------
192-core, using 12 cores per node:
chem.alps5:/work/chem/10ps-md-1> tail 10ps-192cpu-12.pmemd
| ns/day = 10.22 seconds/ns = 8456.85

| Master Setup CPU time: 0.66 seconds
| Master NonSetup CPU time: 84.49 seconds
| Master Total CPU time: 85.15 seconds 0.02 hours

| Master Setup wall time: 3 seconds
| Master NonSetup wall time: 84 seconds
| Master Total wall time: 87 seconds 0.02 hours
-----------------------------------------------------------------------------------------------------
192-core, using 24 cores per node:
chem.alps5:/work/chem/10ps-md-1> tail 10ps-192cpu-24.pmemd
| ns/day = 8.77 seconds/ns = 9855.29

| Master Setup CPU time: 0.56 seconds
| Master NonSetup CPU time: 98.43 seconds
| Master Total CPU time: 98.99 seconds 0.03 hours

| Master Setup wall time: 1 seconds
| Master NonSetup wall time: 99 seconds
| Master Total wall time: 100 seconds 0.03 hours
-------------------------------------------------------------------------------------------
192-core, using 48 cores per node:
chem.alps5:/work/chem/10ps-md-1> tail 10ps-192cpu-48.pmemd
| ns/day = 6.38 seconds/ns = 13536.52

| Master Setup CPU time: 0.78 seconds
| Master NonSetup CPU time: 135.18 seconds
| Master Total CPU time: 135.96 seconds 0.04 hours

| Master Setup wall time: 3 seconds
| Master NonSetup wall time: 136 seconds
| Master Total wall time: 139 seconds 0.04 hours

Apparently, using 1/4 or 1/8 of cores per node approaches the balance
point between the computing
and communication. It is nice to have a LSF parameter (ptile) to
control the number of cores to
use per node.

Regards

Jyh-Shyong Ho, Ph.D.
Research Scientist
National Center for High Performance Computing
Hsinchu, Taiwan, ROC

於 2011/7/29 上午 08:29, Ross Walker 提到:
> Hi Jyh,
>
> Unfortunately what you are seeing makes perfect sense. Consider what you are
> describing here. You have 48 cores in a single node sharing, I assume, a
> single QDR infiniband connection. This means each core has 1/48th of the QDR
> bandwidth to itself. The problem is that people often try to save money by
> maximizing the number of cores in a node but do not scale the interconnect
> appropriately. With 48 cores in a single node you really want 4 or so QDR
> cards per node. QDR is typically used with nodes with 8 or 12 cores maximum
> per node in which case each individual core has IB bandwidth of 4 to 6 times
> what you have in your system.
>
> So, at least for sander you are almost certainly locked to a single node
> unless you leave a bunch of cores idle on the node. E.g. try running 16
> threads with 8 per node, or 32 threads with 8 per node (4 nodes) and I bet
> you see better performance.
>
> A few additional things to note:
>
> 1) sander is unlikely to scale much beyond 64 cores and then only with
> decent interconnect between them. So 8 nodes by 8 cores per node will
> probably work here but 48 cores per node = no way.
>
> 2) sander uses a binary tree when the core count is a power of 2 otherwise
> it switches to a less efficient algorithm. The net result, as printed in the
> output file, is that you can expect better performance when the core count
> is a power of 2. Thus I suspect that 32 cores on 1 node (leaving 16 cores
> idle) will give you better performance than using 48. You may even find
> using 64 cores as 2 nodes by 32 cores each may show some speedup.
>
> 3) You are advised to use PMEMD if it supports the simulation you are
> running. This is better optimized for parallel and will probably scale much
> better. It also does not have the power of 2 core limitation that sander
> has. So try this. Note though that it too will still get choked by the fact
> the interconnect is not balanced for the number of cores in a node so you
> may need to leave cores idle in order to obtain best performance. E.g. try
> using just 16 or 32 cores per node.
>
> In summary, use PMEMD if you can and unfortunately you cannot expect
> miracles if you build hopelessly unbalanced machines. You should also run
> some benchmarks for each of the simulations you wish to run since parallel
> performance will be VERY dependent on the simulation you are running. Larger
> simulations typically scale better. Finally consider leaving a bunch of
> cores idle on each of your machines and you might get better overall
> performance. E.g. 4 nodes by 32 cores per node (128 cores) will probably
> outperform 4 nodes by 48 cores per node (192 cores).
>
> Good luck,
> Ross
>
>> -----Original Message-----
>> From: Jyh-Shyong [mailto:jyhshyong0.gmail.com]
>> Sent: Thursday, July 28, 2011 5:00 PM
>> To: amber.ambermd.org
>> Subject: [AMBER] AMBER/sander parallel performance problem
>>
>> Dear Amber users,
>>
>> I just installed Amber11 on our new cluster computer, and ran some test
>> cases on it.
>> Each node has 48 cores, and all nodes are connected with QDR
>> infiniband.
>>
>> I built parallel version of sander with both mvapich2-1.5 and
>> openmpi-1.4.3. The
>> performance of the program is quite strange:
>>
>> Here is a case using 1 core, it took about 1 hr and 24 min:
>> | Job began at 10:00:53.602 on 07/27/2011
>> | Setup done at 10:00:54.844 on 07/27/2011
>> | Run done at 11:23:48.604 on 07/27/2011
>>
>> And the case using 48 core (one computing node), it took about 9 min
>> 40s
>>
>> | Job began at 20:32:36.572 on 07/28/2011
>> | Setup done at 20:32:38.436 on 07/28/2011
>> | Run done at 20:41:17.902 on 07/28/2011
>>
>> It is nice.
>>
>> However, the case using 192 cores (4 computing nodes), it took about 12
>> min. !
>> | Job began at 20:20:14.506 on 07/28/2011
>> | Setup done at 20:20:17.208 on 07/28/2011
>> | Run done at 20:31:26.587 on 07/28/2011
>>
>> Something is wrong when using more than one computing node, I
>> followed the installation
>> guide and compile the program using both intel and gcc compiler with
>> MKL
>> lilbrary. I always
>> got the similar result.
>>
>> Any hint on how this could happen and who to fix the problem?
>>
>> Thanks.
>>
>> Jyh-Shyong Ho, Ph.D.
>> Research Scientist
>> National Center for High Performance Computing
>> Hsinchu, Taiwan, ROC
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jul 28 2011 - 19:30:03 PDT