RE: AMBER: Amber9's sander.MPI on x86_64 from Ross Walker on 2007-06-04 (Amber Archive Jun 2007)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Mon, 4 Jun 2007 11:20:43 -0700

Dear Karl,

Firstly the reason you are seeing 4 processors per node is that this is
exactly what you asked for:

> #PBS -l nodes=4:ppn=4

You are asking here for a total of 4 nodes and 4 processors per node.

Next when you talk about running sander instead of sander.MPI then indeed
you get 16 copies of sander run at 100% cpu usage each (assuming there are 4
physical cpus per box) this makes complete sense since there is no
communication going on. You are running 16 identical single processor runs.

In terms of the efficiency on 16 processors the performance you are likely
to see is really going to be a function of how good your cluster is. If it
was built on the cheap with gigabit ethernet and bargain basement switches
then you will be lucky to see any performance increase once you go off a
node. So 4 processors max. With PMEMD you might see some improvement up to
two boxes but this is likely the limit. To get any better you really need to
invest in a decent interconnect such as infiniband and make sure it
configured properly.

> When we submit a 16 processor job using the command in
> our Torque
> run file (as shown below): "mpiexec -machinefile $MACHINEFILE
> -np 16 /
> usr/local/Dist/amber9/exe/sander.MPI -O ... " each node shows four
> sander processes at 0 or 0.1% each.

So this is probably correct - it is probably spending its whole time just
waiting for communication. The thing to remember is that there are a huge
number of compelexities in running in parallel and there is a reason decent
'supercomputers' cost millions. More cpus is not always faster - especially
with commodity interconnects and in cases where you share the machine with
others. Secondly it will depend on the type of job you are running. GB
simulations generally scale better than explicit solvent simulations. Larger
simulations will scale better. So if you are only testing say NMA in gas
phase (12 atoms) then don't expect to see any benefit in moving to multiple
processors. If, however, your system is 100K atoms then you will generally
benefit from running in parallel up to the limitation of your interconnect,
file system etc. Then performance will also be affected by your file system.
If you share the cluster with a ton of other people and it is just naively
setup to do all file I/O to a single NFS server over the same interconnect
as MPi traffic then you can kiss any sort of decent performance good bye.
Ideally these things should have decent SAN's attached to all nodes on a
seperate interconnect fabric. If you write fairly infrequently, say every
500 steps and other people aren't hammering the system then you can get away
with the NFS approach but don't be surprised if you suddenly find your
calculation slows down as other people load up the file system.

> Furthermore, without using Torque(PBS) and submitting
> by command
> line "mpiexec -np 16 /usr/local/Dist/amber9/exe/sander.MPI -O
> -i ..."
> we have 16 sander processes spawned, 4 per node on a total of 4
> nodes. However, each process is running at ~10%, which doesn't seem
> efficient.

If your calculation fits within the remit of what PMEMD supports then you
should be using this for parallel runs, see $AMBERHOME/src/pmemd to compile
it. This will generally give you better performance.

> Without Torque(PBS), submitting by command line
> "mpiexec -np 16 /usr/
> local/Dist/amber9/exe/sander -O -i ...", we have 16 sander processes
> spawned, 4 per node on a total of 4 nodes, with each process running
> at 100%. Does this mean we have 16 jobs running in serial,
> overwriting the output 16 times?

Yes, and all outputing to the same files so everything will be horribly
corrupted.

> Does anybody have any insight into what is going on?
> How do we get
> sander.MPI to run in parallel at maximum CPU efficiency?

Maxium cpu efficiency will always be when you run on 1 cpu. As you increase
the number of cpus efficiency will go down. But you typically get a net gain
because the efficiency drops off more slowly than the cpu count. To a point
and then things start to slow down again. So I would suggest firstly seeing
if you can use PMEMD. Then run 1, 2, 4, 8, 16 cpus calculations and check
the timings.

If you are using a gigabit interconnect then you could ask whoever is in
charge of the machine to turn on flow control on the switch and turn off QOS
scheduling - assuming all your hardware supports this. This may help a
little bit. Alternatively if your interconnect is not ethernet then you
likely need to reconfigure mpich2 to make sure it is using the correct
interconnect.

All the best
Ross

/\
\/
|\oss Walker

| HPC Consultant and Staff Scientist |
| San Diego Supercomputer Center |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
| http://www.rosswalker.co.uk | PGP Key available on request |

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Wed Jun 06 2007 - 06:07:20 PDT