Re: AMBER: PMEMD and sander from AMBER6 performances

From: Robert Duke <rduke.email.unc.edu>
Date: Wed, 16 Jul 2003 23:11:55 -0400

Teletchea -

There are of course several things going on here.

First of all, the overall performance you observe for the benchmark is less
than what was published in the PMEMD release note because you are not
running exactly the same benchmark. If you look at the release note, it
specifies that the JAC benchmark is run, with the exception that default
values are used for the "cut" (cutoff) and "skinnb" (pairs list skin cutoff)
parameters. Thus, when running sander 6 or PMEMD in sander 6 mode, values
of cut = 8 and skinnb = 1.5 are used, not the values of 9 and 2 that are
part of the standard JAC benchmark. The use of a cutoff of 8 rather than 9
has a big impact on performance, and the use of a skinnb of 1.5 rather than
2.0 helps PMEMD by something like an additional 5%. Overall, the difference
in performance expected from simply changing these two parameters to the
default is on the order of 33% (at least on my Athlon uniprocessor). SO
that accounts for the bulk of the performance differential you see.

Regarding scalability, I would expect a gigabit ethernet cluster to be
impacted by:
1) the overall load on the cluster interconnect.
2) system network configuration issues.
3) particulars of the networking (nic's, switches, and cables) hardware.
4) particulars of the cpu's and motherboards.
5) particulars about the disk used by the master node.

1) If there was any other load at all on the cluster network, that would
cause you to observe worse scalability than what I observed. My tests on
gigabit ethernet were run with nothing else at all running on the cluster.
An extraneous load (another job running) could easily cut throughput by
10-20%. Also, the Athlon numbers I published are with Myrinet, which is a
better switch than Gigabit ethernet, with better scalability (so you can't
compare). Incidentally, I think there are plans to upgrade our gigabit
stuff to Myrinet. I am not in love with either solution, but Myrinet
definitely scales better (my numbers on Myrinet with low numbers of nodes
don't look that much better, but what you don't see is that about 40-60
other cpu's are using the interconnect).

2) There may be some subtle network configuration issues. I will forward
this mail to the UNC-CH systems folks that support our cluster. They did
some tweaking that definitely improved the performance (I had some pre-tweak
numbers that were worse). It may have been something as simple as
increasing the tcp/ip buffer sizes - I don't know, but will ask, and forward
any info I get.

3) Not all nic's (network interface cards) and gb ethernet switches are
created equal, I would expect (I know this to be true of nic's for sure). I
also expect that PCI bus implementation differences will have an impact.

4) Different motherboard/cpu combinations may exhibit different performance
characteristics, especially in dual processor configurations. Because the
bandwidth on gigabit ethernet is so poor, the dual processor shared memory
performance can have a big impact (because shared memory is used for comm
instead of the switch on the dual nodes).

5) In general, I don't see PMEMD getting io-bound by disk i/o unless you
write coords/vels/restarts frequently and are running on lots of nodes. If
your nfs storage is really slow, I suppose it could be an issue, though. I
did some optimizations to limit the impact of writing mdout and mdinfo.

I was interested to see that you actually did a mixed cpu cluster run.
PMEMD does dynamic load balancing, so it can take advantage of such
configurations without being dragged down to the speed of the slowest node.
I would expect DSLOW_NONBLOCKING_MPI to not have much affect by the time you
are running on 14 nodes, but find it interesting that you got slightly
better performance without it on 14 mixed nodes.

I'll try to get you the system config info. Let me know if you need any
more info.

Regards - Bob Duke

----- Original Message -----
From: "Teletchéa Stéphane" <steletch.biomedicale.univ-paris5.fr>
To: <amber.scripps.edu>
Sent: Wednesday, July 16, 2003 7:37 PM
Subject: AMBER: PMEMD and sander from AMBER6 performances


> Hi !
>
> I've been improving our cluster performance by adding more nodes, and
> fortunately last week i got a copy of PMEMD which actually multilied the
> performance by 1.75x more !
>
> Nice, but i'm not able to get the same numbers ...
>
> --------------------------------------------------------------------
>
> First, i must say that i have used amber6 compiled with g77-2.96 of
> RH7.1, mpich-1.2.5 and PMEMD have been compiled with latest icc/ifc7
> from intel, as instructed in PMEMD documentation.
>
> I would be very pleased if you could explain me why i am not able to
> reach the same performance, on, as it seems, same configurations ?
>
> --------------------------------------------------------------------
> Taken this into consideration i get for example for the IBM blade xeon
> 2.4Ghz/gigabit -closest to my system- :
> JAC benchmark : sander6 pmemd
> >From PMEMD : 130ps/day 230ps/day (2.4Ghz xeons from IBM)
> Mine : 110ps/day 209ps/day (2.8Ghz xeons from Alineos)
>
> Or my xeon is 2.8, so i should get 130*2.8/2.4=152ps or
> 230*2.8/2.4=268ps roughly speaking, but not LESS than the 2.4Ghz !
>
> Any explication for this 30/40% drop ?
>
> The same for the athlon : sander6 pmemd
> from MICRONPC 1.6 GHz athlon : 62.6ps/day 122ps/day
> My athlon (half of the bi 1.2): 37.6ps/day 67ps/day
>
> Again : i should get 62.6*1.2/1.6=47ps or 122*1.2/1.6=91ps.
>
> Any explication for this 25/35% drop ?
>
> Performance increase between sander6 and PMEMD) is as described (from
> 1.78 to 2.20 faster between the 2 !).
>
> Scalability is poor on my system compared to what is published.
>
> Any hint ?
>
> May be the nfs homes ?
> I'm using PBS to handle the jobs, i've tried locally (on the node) to
> launch it, but i can get the same results.
>
> All needed parameters are (hopefully) bellow.
>
> I've installed src.pmemd in amber6 tree as indicated, did i miss one
> step ?
>
> Sincerely yours,
> Stéphane TELETCHEA
>
> --------------------------------------------------------------------
>
> The cluster is gigabit linked (with its own switch), home files are
> mounted on each node on a separate nfs fast ethernet network (with its
> own switch).
>
> There are 2*4 athlons 1.2Ghz and 2*3 xeons 2.8 Ghz, controlled by one
> master (1.2Ghz AMD).
>
> Here are my numbers from the JAC benchmark (input file at the bottom of
> the mail) downloaded directly from ftp :
>
> -----------------------------------------------------
> -----------------------------------------------------
> Relative performance analysis of sander6 vs pmemd
> System : DHFR, also known as JAC
> 23558 atoms - 7182 molecules - Box : 64x64x64 Ang.
> 1000 steps of dynamices run - time is ps/day.
> -----------------------------------------------------
> Note that this benchmark uses a 1fs timestep, so
> the calculation time is for 1 ns of trajectory.
> ----------------------------------------------------------------------
> | Processor(s) | Clock | SANDER6 | PMEMD* | PMEMD*/sander6 |
> ----------------------------------------------------------------------
> | 1 athlon(s) | 1.2Ghz | 37.6 (1x) | 0 (est. 67)| 1.78x |
> | 2 athlon(s) | 1.2Ghz | 68.6 (1.82x) | 122 (1.82x) | 1.78x |
> | 4 athlon(s) | 1.2Ghz | 118 (3.14x) | 216 (3.22x) | 1.83x |
> | 6 athlon(s) | 1.2Ghz | 153 (4.07x) | 299 (4.46x) | 1.95x |
> | 8 athlon(s) | 1.2Ghz | 189 (5.03x) | 365 (5.44x) | 1.93x |
> ------------------------------------------ PMEMD_p4 [--------------
> | 1 xeon(s) | 2.8Ghz | 63.7 (1x) | 0 (est.115)| 1.80x |
> | 2 xeon(s) | 2.8Ghz | 110 (1.73x) | 209 (1.82x) | 1.90x |
> | 4 xeon(s) | 2.8Ghz | 176 (2.76x) | 348 (3.03x) | 1.98x |
> | 6 xeon(s) | 2.8Ghz | 214 (3.36x) | 470 (4.09x) | 2.20x |
> ----------------------------------------------------------------------
>
> For the whole cluster (no PMEMD_p4) :
>
> | Processor(s) | SANDER6 | PMEMD* | PMEMD | PMEMD/sander6 |
> ----------------------------------------------------------------
> | 14 processors | 280 | 649 | 700 | 2.32x / 2.5x |
> speedup/1 athlon : 7.45x [ 9.69x | 10.45x |
> speedup/1 xeon : 4.40x [ 5.64x [ 6.09x
> -----------------------------------------------------
> PMEMD* indicates PMEMD has been compiled with the
> option -DSLOW_NONBLOCKING_MPI
>
> PMEMD_p4 indicates PMEMD has been compiled specifically for
> taking advantage of P4 instructions.
>
> PMEMD indicates PMEMD has been compiled WITHOUT the
> option -DSLOW_NONBLOCKING_MPI
>
> -----------------------------------------------------
> -----------------------------------------------------
>
> An AMD dual MP2800+ is about 5% slower than od dualxeon2.8Ghz with
> intel's compiler:
>
>
>
> [root.master0 bin]# icid
> OS information:
> Red Hat Linux release 7.1 (Seawolf)
> Kernel 2.4.20 on an i686
> glibc-2.2.4-19
>
> ===========================================================
> Support Package IDs for Intel(R) Compilers in
> /opt/intel/compiler70/ia32/bin
> Please use the following information when submitting customer support
> requests.
>
> C++ Support Package ID : l_cc_p_7.1.006-NCOM
> Fortran Support Package ID: l_fc_p_7.1.008-NCOM
> ===========================================================
> C++ & Fortran License Expiration Date: never expire
> C++ & Fortran Support Services Expiration Date: never expire
>
> All Installed Compiler Components on this OS:
> intel-isubh7-7.1-6: Substitute Headers for Intel(R) C++ Compiler for
> 32-bit applications, Version 7.1
> intel-ifc7-7.1-8: Intel(R) Fortran Compiler for 32-bit applications,
> Version 7.1 Build 20030307Z
> intel-icc7-7.1-6: Intel(R) C++ Compiler for 32-bit applications, Version
> 7.1 Build 20030307Z
>
>
>
> ------------------------------------------------------------------
>
> The input file for JAC, i've just changed the number of steps.
>
> [stephane.master0 DM_Tcte300_H2O]$ more dn300K
> short md, nve ensemble
> &cntrl
> ntx=7, irest=1,
> ntc=2, ntf=2, tol=0.0000001,
> nstlim=1000,ntcm=1,nscm=1000,
> ntpr=50, ntwr=10000,
> dt=0.001, vlimit=10.0,
> cut=9.,
> ntt=0, temp0=300.,
> &end
> &ewald
> a=62.23, b=62.23, c=62.23,
> nfft1=64,nfft2=64,nfft3=64,
> skinnb=2.,
> &end
>
> --
> *~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~*
> Teletchéa Stéphane - CNRS UMR 8601
> Lab. de chimie et biochimie pharmacologiques et toxicologiques
> 45 rue des Saints-Pères 75270 Paris cedex 06
> tél : (33) - 1 42 86 20 86 - fax : (33) - 1 42 86 83 87
> mél : steletch.biomedicale.univ-paris5.fr
> *~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~*
>
>
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to amber.scripps.edu
> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
>
>



-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Thu Jul 17 2003 - 04:53:01 PDT
Custom Search