Hi Yuann,
> PS: Each nodes communicates with each other by one GbE switch (3COM 2924-
> SFPplus)
To follow up on what Bob Duke said the problem is the gigabit interconnect.
Essentially in the is day and age of multiple cores inside a node you cannot
use ethernet to run MD in parallel. At least not regular MD. Things like
thermodynamic integration should probably work okay, as long as you make
sure your core mapping in the machine file is such that you run 16 threads
but in the form of 8 threads for each image with those 8 threads all
residing on the same node. The same is true for things like replica
exchange. The only real solution for running MD in parallel across multiple
nodes is a 'real' interconnect such as infiniband or myrinet etc. Something
designed to do MPI (in hardware) as opposed to wrapping it up into tiny
TCP/IP packets and sending it across the equivalent of the internet.
Remember gigabit ethernet first came out in the days on the Pentium 2 300.
The at that time (ignoring latency and lots of other issues) the bandwidth
to cpu speed ratio was 1000/300 = 3.3. Consider the situation now. You have
2xQuad core (ignoring all the extra SSE stuff which effective doubles /
triples the performance per Mhz potentially) the ratio would now be
1000/(2800*8) = 0.0446 - the problem should thus be immediately obvious.
> We have compiled AMBER10 on the machines & platforms which are the same
> as those described by Ross Walker in Amber 10 Benchmarks. (Dual XEON E5430
> on SuperMicro X7DWA-N)
> We use mpich2-1.0.8 & ifort9.1 to build sander.MPI, the benchmark of
> original JAC by sander.mpi seems fine
> (2cpu: 161sec, 4cpu: 88sec, 8cpu: 54sec),
As Bob said the benchmarks I showed were for PMEMD which is designed to
significantly outperform sander. It supports a subset of the methods
(essentially PME and GB MD) but if the calculations you want to run fall
within this feature set you will get better performance using PMEMD here. As
you observe though within a machine sander does at least scale to all 8 cpus
- although as usual with these multicore machines they are woefully
underspecced on memory bandwidth so the scaling dies off once you try to use
all the cores in a node.
Note I did not give any benchmarks beyond 8 cpus for these machines on the
website. This is because you can't get any scaling over ethernet. If you
want to run larger you will need to buy some infiniband clusters or
alternatively see if there is a supercomputer center at which you can obtain
time.
> (For 16cpu computation, abnormal usage of system CPU (60~70%) was observed
> by top or Ganglia monitoring, while 8cpu computation was fine & system CPU
> < 5% & user CPU > 95%)
This is due to the cpus either just spinning at barriers waiting for data to
arrive over the ethernet or spending their whole time encoding and deconding
tcp/ip packets.
> Can anyone give me some ideas to solve this problem while running parallel
> sander jobs across nodes?
It cannot be solved - not without a new interconnect, sorry. The laws of
physics are against you here I am afraid. As I said above though you should
be able to run things like TI calculations over 2 nodes and REMD simulations
over all the nodes as long as you are careful to make sure all threads for a
given 'image' run on the same node.
All the best
Ross
/\
\/
|\oss Walker
| Assistant Research Professor |
| San Diego Supercomputer Center |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
|
http://www.rosswalker.co.uk | PGP Key available on request |
Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.
-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" (in the *body* of the email)
to majordomo.scripps.edu
Received on Fri Dec 05 2008 - 18:00:26 PST