Re: AMBER: amber8 parallel sander from Robert Duke on 2005-01-17 (Amber Archive Jan 2005)

From: Robert Duke <rduke.email.unc.edu>
Date: Mon, 17 Jan 2005 13:13:10 -0500

Yen -
Some wild guesses here. Less than 100% cpu utilization indicates that you DO have network problems (you are blocked waiting on the network instead of doing calcs). Also, seeing high utilization in the poll() probably indicates that you are wasting a lot of your cpu time spin-locking in the poll, waiting for network communications to occur. A good diagnostic is the time data in the logfile or mpi_profile file. This will give you some indication of how much time is being spent in communications. I would GUESS that you have one or all of the following going on 1) your system tcp/ip buffer size is small, 2) the network cards are slow, and possibly not on a fast system bus, 3) the ethernet switch is not operating at full duplex, and/or is basically slow. Just because you don't appear to be pushing the net to a full GB/sec (actually with full duplex, you push it this fast both ways) does not mean that you don't have a net problem. The problem with your hardware and how the system is configured may well be that they are not truly capable of attaining 2 GB/sec throughput (ie., both ways), or they may just need different configs. IF you are seeing 100% cpu utilization, and it is not in poll() and other network calls, and the metric for FRC_collect time in profile_mpi (or FXdist in the logfile of pmemd) is small, then your network is doing okay.
Regards - Bob
  ----- Original Message -----
  From: yen li
  To: amber.scripps.edu
  Sent: Monday, January 17, 2005 12:55 PM
  Subject: Re: AMBER: amber8 parallel sander

  Hi Robert & Carlos,
  Thanks for the elaborate replies.

  It's true that we are sharing the switched GB ethernet for other
  purposes also. To check wether network is the chocke point, we collected
  network statistics while the benchmarking was running for 16 processors.
  We find that the network utilization is never more than 10% for any of
  the hosts and the link is up at 1GB/s. While doing this we noticed
  unusually high system usage % like +50%. To find out the cause, we
  collected the system calls being generated by one process of the 16 processes
  ("$> truss -p pid"). The results show that it's mostly system call
  "poll" which returns 0(like +80% of the time) and error EAGAIN(like 15% of
  the time).

  For linux the equivalent command of truss would be "$>strace -p pid".
  Can someone please suggest any way to improve the performance.

  Best Regards,
  Yen

> Robert Duke <rduke.email.unc.edu> wrote:
> Yen -
> Once again I will chime in after Carlos :-) Especially with GB
  ethernet,
> "your mileage may vary" as you attempt to scale up to higher
  processor
> count. The performance gain can quickly be inconsequential, and you
  can
> even lose ground due to all the parallel overhead issues. Are you
  running
> sander or pmemd? PMEMD is faster in these scenarios and takes less
  memory,
> but it only does a subset of the things that sander does, and still
  cannot
> overcome all the inadequacies of GB ethernet. So what are the major
  factors
> that determine how well things work? In a nutshell, they are:
> 1) Problem size. As you have more atoms, it takes more memory on each
  node
> for the program AND for MPI communications. I cannot get sander 8 to
  run at
> all on 4 pentium nodes/GB ethernet/mpich for the rt polymerase
  problem which
> is ~140K atoms. The atoms are constrained which actually makes things
  worse
> in some regards (extra data). It runs fine for sander 8 on 1 and 2
> processors. I runs fine for pmemd on 1, 2, and 4 pentium/GB nodes.
  The
> really important problem in all this is the distributed fft transpose
  for
> reciprocal space calcs. That quickly swamps a GB interconnect for
  large
> problem size.
> 2) SMP contribution. An aspect of pentium GB ethernet setups that is
  often
> not emphasized is that dual pentium cpu's (they share memory) are
  often
> used. Now this is bad from a standpoint of memory cache utilization,
> because each processor has less cache to work with, and depending on
  the
> hardware there are cache coherency issues. But from an interconnect
> standpoint it is great because mpi running over shared memory is
> significantly faster than mpi over GB ethernet. So if you have a
  collection
> of uniprocessors connected via GB ethernet, I would not expect much.
> 3) Network hardware configuration. Are the network interface cards on
  the
> ultrasparc's server-grade; ie. capable of running full bandwidth full
  duplex
> GB ethernet without requiring significant cycles from the cpu? If
  not, then
> things won't go as well. Server nic's typically cost more than twice
  as
> much as workstation grade nics. How about the ethernet switch? Cheap
  ones
> DO NOT work well at all, and you will have to pay big bucks for a
  good one
> (folks can chime in with suggestions here; I run two dual pentiums
  connected
> with a XO cable so there is no switch overhead). Cables? Well, think
  CAT
> 5e or 6, I believe. This is getting to be more commonly available;
  just be
> sure your cables are rated for 1 Gbit and not 100 mbit/sec.
> 4) MPI configuration. I don't mess with suns, so I have no idea if
  they
> have their own mpi, or if you are running mpich or whatever. If you
  are
> running mpich, there are ways to screw up the s/w configuration, not
> allowing for shared memory use, not giving mpich enough memory to
  work with,
> etc. There are some notes on the amber web site, probably by me,
  Victor
> Hornak, et al.
> 5) System configuration. For linux, there are networking config
  issues
> controlling how much buffer space is set aside for tcp/ip
  communications.
> This can have a big effect on how fast communications actually is,
  and
> typically you sacrifice system memory for speed. See the amber web
  page for
> notes from me relevant to linux; once again I have no idea what is
  required
> for sun hw/sw.
> 6) Other network loading issues. Is the GB ethernet used a dedicated
  GB
> ethernet, with no traffic for internet, other machines, NFS, etc.,
  etc.? Is
> anyone else using other nodes at the same time (ie., perhaps several
  mpi
> jobs running over the same GB ethernet). If there is any other
  network load
> whatsoever, your performance will be worse, and it may be
  substantially
> worse.
>
> What is the best you can expect? Well, my latest performance work for
  pmemd
> (not yet released version) yields the following throughput for my two
  dual
> pentiums (3.2 GHz), XO GB-connected, running 90,906 atoms, constant
> pressure, particle mesh ewald:
>
> # proc psec/day
>
> 1 95
> 2 155
> 4 259
>
> Now, you will note that the scaling is not great, and this is about
  as good
> as it gets for this kind of hardware. This IS a large problem (91K
  atoms),
> and you should do significantly better on scaling if your problem is
> smaller. By the way, comparison numbers for 1 and 2 procs, this same
> hardware, pmemd 8 and sander 8 are:
>
> # proc psec/day, pmemd 8 psec/day, sander8
>
> 1 76
> 54.5
> 2 121
> 88
>
> Now I don't have any data for 8 and 16 processors here, simply
  because I no
> longer have access to that type of hardware in reasonable condition.
  A
> while back I did runs on 2.4 GB blades at UNC for pmemd 3.0- 3.1, and
  was
> seeing numbers like this (abstracted from pmemd 3.0 release notes,
  there is
> lots of good info about performance on various systems in the various
  pmemd
> release notes, available on the amber web site). NOTE that I had
  exclusive
> access to the blade cluster (no other jobs running on it), and the GB
> ethernet was dedicated to mpi, not shared for internet, etc.:
>
>
  *******************************************************************************
> LINUX CLUSTER PERFORMANCE, IBM BLADE XEON 2.4 GHZ, GIGABIT ETHERNET
>
  *******************************************************************************
> The mpi version was MPICH-1.2.5. Both PMEMD and
> Sander 7 were built using the Intel Fortran Compiler.
>
> 90906 Atoms, Constant Pressure Molecular Dynamics (Factor IX)
> #procs PMEMD Sander 6 Sander 7
> psec/day psec/day psec/day
> 2 85 46 59
> 4 153 71 99
> 6 215 ND ND
> 8 272 114 154
> 10 297 ND ND
> 12 326 122 ND
> 14 338 ND ND
> 16 379 127 183
>
> There are reasons that people spend those large piles of money for
> supercomputers. PMEMD runs with pretty close to linear scaling on
  hardware
> with a fast interconnect out to somewhere in the range of 32-64
  processes,
> and is usable (greater than 50% scaling) at 128 processors and
  beyond. I
> can get close to 7 nsec/day for the problem above on that sort of
  hardware
> (once again, unreleased version of pmemd, but you will see it in the
> future).
>
> Regards - Bob Duke
>
>
> ----- Original Message -----
> From: yen li
> To: amber.scripps.edu
> Sent: Wednesday, January 12, 2005 8:54 AM
> Subject: Re: AMBER: amber8 parallel sander
>
>
> Hi,
> Thanks Robert & Carlos for the clearifications.
>
> I have one more related doubt. I also timed the same simulations for
  the 4
> cases: namely 1, 4, 8 & 16 processors. I find that it's the fastest
  for 4
> and slower for 1, 8 &16. I can understand for 1 but cannot understand
  it
> getting slower for increased number of processors.
>
> All the processors are of the same make(Sun UltraSparc III+), same
> OS(Solaris 8), same amount of RAM(1GB each) and connected over 1GBps
> network.
>
> Thanks
>
>
> Robert Duke wrote:
> Yen -
> As Carlos says, this is expected. The reason is that when you
  parallelize
> the job, the billions of calculations done occur in different orders,
  and
> this introduces different rounding errors. With pmemd, you will even
  see
> differences with the same number of processors, and this is because
  there is
> dynamic load balancing of direct force workload (ie., if one
  processor is
> taking more time to do the direct force calcs, it will be assigned
  fewer
> atoms to work on). You have to remember that the internal numbers
  used have
> way more precision than is justified by our knowledge of parameters,
  or for
> that matter how well the method represents reality, and that any one
> represents one of many possible trajectories.
> Regards - Bob Duke
> ----- Original Message -----
> From: yen li
> To: amber.scripps.edu
> Sent: Wednesday, January 12, 2005 5:57 AM
> Subject: AMBER: amber8 parallel sander
>
>
> Hello amber
> I am testing the parallel version of amber8. I am running an md
  simulation
> over a small protein.
> I am testing the calculations on four, eight and sixteen processors.
  My
> problem is that, initially the energy values in the output files are
  the
> same, but as the simulation proceeds, the values start to diverge
  making the
> differences large. Is this kind of behaviour ok or do i need to take
  care of
> some parameters.
> Thanks
>
>
>
> Do you Yahoo!?
> Yahoo! Mail - Find what you need with new enhanced search. Learn
  more.
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://www.rosswalker.co.uk/adsense_alternatives/
>
>
>

------------------------------------------------------------------------------
  Do you Yahoo!?
  The all-new My Yahoo! - What will yours do?
-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Mon Jan 17 2005 - 18:53:01 PST