Re: AMBER: Sander slower on 16 processors than 8 from Martin Stennett on 2007-02-22 (Amber Archive Feb 2007)

From: Martin Stennett <martin.stennett.postgrad.manchester.ac.uk>
Date: Thu, 22 Feb 2007 21:10:40 -0000

In my experience Sander slows dramatically with even two processors. The message passing interface used means that it frequently drives itself into bottlenecks, with one or more processors waiting for very long periods for others to finish.
It also passes an extra-ordinary amount of data between threads, though with your setup this shouldn't be as much of a factor as it was on my test system.
To me it seems that AMBER is great from the point of view of a chemist, and very accessible should one want to change it. While from a computational point of view needs a bit of optimisation and tweaking before it should be considered as a serious solution.
Martin
  ----- Original Message -----
  From: Sontum, Steve
  To: amber.scripps.edu
  Sent: Thursday, February 22, 2007 8:32 PM
  Subject: AMBER: Sander slower on 16 processors than 8

  I have been trying to get decent scaling for amber calculations on our cluster and keep running into bottlenecks. Any suggestions would be appreciated. The following are benchmarks for the factor_ix and jac on 1-16 processors using amber8 compiled with pgi 6.0 except for the lam runs which used pgi 6.2



  BENCHMARKS

  mpich1 (1.2.7) factor_ix 1:928 2:518 4:318 8:240 16:442

  mpich2 (1.0.5) factor_ix 1:938 2:506 4:262 8:*

  mpich1 (1.2.7) jac 1:560 2:302 4:161 8:121 16:193

  mpich2 (1.0.5) jac 1:554 2:294 4:151 8:111 16:181

  lam (7.1.2) jac 1:516 2:264 4:142 8:118 16:259



  * timed out after 3hours

  QUESTIONS

  First off, is it unusual for the calculation to get slower with increased number of processes?

  Does anyone have benchmarks for a similar cluster, so I can tell if there is a problem with the configuration of our cluster? I would like to be able to run on more than one or two nodes.



  SYSTEM CONFIGURATION

  The 10 compute nodes use 2.0GHz dual core opteron 270 chips with 4GB memory and 1Mb memory Cache, tyan 2881 motherboards, HP Procurve 2848 switch, and single 1Gb/sec Ethernet connection to each motherboard. The master node is configured similarly but also has a 2TB of raid storage that is automounted by the compute nodes. We are running SuSE 2.6.5-7-276-smp for the operating system. Amber8 and mpich were compiled with pgi 6.0.



  I have used ganglia to look at the nodes when a 16 process job is running. The nodes are fully consumed by system CPU time. The User CPU time is only 5% and this node is only pushing 1.4 kBytes/sec out over the network

  Steve

  ------------------------------

  Stephen F. Sontum
  Professor of Chemistry and Biochemistry
  email: sontum.middlebury.edu
  phone: 802-443-5445
-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Sun Feb 25 2007 - 06:07:23 PST