RE: AMBER: MPI is slower than single processor with water from Ross Walker on 2007-01-16 (Amber Archive Jan 2007)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Tue, 16 Jan 2007 08:50:36 -0800

Dear Mike,

> Protein MD and minimization calculations run much faster
> with MPI than with a single processor, as expected,
> EXCEPT when I include water (solvateoct with TIP3PBOX).
> In this case, the calculations actually run a little
> slower compared to using a single processor.
>
> Is this a general problem when performing calculations
> with many molecules, or have I misset (or have not used) a
> necessary flag?

This is a complicated issue that very much depends on your system setup and
in particular the interconnect between your nodes. First of all if you are
running GB or periodic PME calculations then you should compile and use
PMEMD v9 which ships with Amber 9. This has been specifically optimized by
Bob Duke to run efficiently in parallel.

Secondly we really need to know more about your system. Not getting any
speedup over a single processor is unusual. At the very least if you have a
dual processor machine and you run a 2 processor job you should about a 1.9x
speedup. The issue comes when communication must move out of the local node.
If you have a very poor switch (such as a cheap 10GB switch which doesn't
have a true non-blocking back plane) or even worse a hub at say only 100MBps
speed then you will be lucky, on modern cpus, to see any speedup. This is
especially true if you also use that same network for NFS traffic or other
users of your cluster share it. If your cluster is bigger than a single
switch such that you have multiple switches chained together and your
queuing system doesn't ensure switch locality for your jobs then you might
as well forget it. That said with a gigabit ethernet switch and single
processor nodes you should at least be able to get to about 4 cpus or so.

Ideally when building a cluster for running tightly integrated parallel
calculations such as molecular dynamics you need a decent communication
network such as myrinet or ideally infiniband. It also helps if you have all
file system and management traffic routed over a different network to the
mpi traffic.

Note the problem is also a function of the speed of the individual cpus.
Back in the day of 1GHz pentium 3's with a gigabit interconnect you could
easily scale to between 8 or 16 cpus (depending on system size - typically
the more atoms in your system the better scaling you get). This is because
the ratio of compute speed to the interconnect speed was quite good. Now we
have 3.8GHz machines with SSE2 that can do multiple floating point
operations per clock cycle but people are still hooking them up with gigabit
ethernet. Here the cpu is around 10 to 12 times quicker or more but the
communication speed has remained the same - not good... Even worse they have
4 cpus per box and only one gigabit ethernet card which is like having only
a 250MBps interconnect even before you take into account the extra overhead
of packet collisions etc.

So, hopefully the above helps a bit. It would be useful to know what the
exact specs of your system are as well as the specs of the actual
calculation you are trying to run and then we might be able to help out some
more.

All the best
Ross

/\
\/
|\oss Walker

| HPC Consultant and Staff Scientist |
| San Diego Supercomputer Center |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
| http://www.rosswalker.co.uk | PGP Key available on request |

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Wed Jan 17 2007 - 06:08:24 PST