RE: AMBER: Amber9 with MPICH2 failure at runtime from Ross Walker on 2008-04-16 (Amber Archive Apr 2008)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Wed, 16 Apr 2008 16:47:03 -0700

Hi Sasha,

You could try PMEMD, it can help a bit with gigabit ethernet but it will really depend on the size of the system you are running. The bigger the simulation the better it is likely to scale. Note PMEMD works quite well with a cross over cable - generally it is cheap gigabit switches that kill you.

See if the switch supports flow control - if it does, turn it on, since that will help stop packet loss. Otherwise you are really just fighting the laws of physics - especially when you have multiple cores in one box but only one gigabit link per box. And even worse if you also use that link for NFS traffic.

Something like Myrinet or Infiniband is really the only option these days. Or get an xRAC allocation from NSF to use "real" ;-) machines.

All the best
Ross

/\
\/
|\oss Walker

| Assistant Research Professor |
| San Diego Supercomputer Center |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
| http://www.rosswalker.co.uk <http://www.rosswalker.co.uk/> | PGP Key available on request |

Note: Electronic Mail is not secure, has no guarantee of delivery, may not be read every day, and should not be used for urgent or sensitive issues.

  _____

From: owner-amber.scripps.edu [mailto:owner-amber.scripps.edu] On Behalf Of Sasha Buzko
Sent: Wednesday, April 16, 2008 15:43
To: amber.scripps.edu
Subject: RE: AMBER: Amber9 with MPICH2 failure at runtime

Thanks, Ross.
I guess what caused my suspicions is that the serial version didn't blow up, but completed the trajectory with no incident. Anyway, I ran the 12A cutoff bounded version, and that worked.
I do have a performance-related question, though. Our cluster was initially designed for docking jobs that are easily splittable and the nodes don't need to have fast interconnects since there is no cross-talk. So the connections are gigabit ethernet. I noticed that running the same example using sander.MPI on 4 nodes takes about twice the time it did on one. Do you think using PMEMD could somehow alleviate this issue or is it hopeless to run amber without infiniband interconnects?

Thanks

Sasha

On Wed, 2008-04-16 at 15:05 -0700, Ross Walker wrote:

Hi Sasha

This looks perfectly okay to meet - read the next part of the tutorial and it will explain why things blow up. The serial one should blow up as well - will just take longer (in wallclock time) to reach that point.

All the best

Ross

/\
\/
|\oss Walker

| Assistant Research Professor |
| San Diego Supercomputer Center |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
| http://www.rosswalker.co.uk <http://www.rosswalker.co.uk/> | PGP Key available on request |

Note: Electronic Mail is not secure, has no guarantee of delivery, may not be read every day, and should not be used for urgent or sensitive issues.

  _____

From: owner-amber.scripps.edu [mailto:owner-amber.scripps.edu] On Behalf Of Sasha Buzko
Sent: Wednesday, April 16, 2008 14:43
To: amber.scripps.edu
Subject: AMBER: Amber9 with MPICH2 failure at runtime

Hi,
I installed and tested MPICH2 on several cluster nodes, as well as compiled amber9 with MKL support and static linking. make test.parallel went fine, with the exception of a couple of possible failures (didn't follow up on those yet).
To test further, I used an example from an Amber tutorial (piece of DNA). When run as a serial Amber, all works fine and produces expected output. The parallel version, however, fails even when run on a single node (one entry in the mpd.hosts file). The output is below. I did run the resulting trajectory using Sirius, and it looked fine, except that it's incomplete, as opposed to the serial version output. Do you have any suggestions as to why this might be happening in the parallel version?

Thank you

Sasha

[sasha.node6 test]$ mpiexec -n 4 $AMBERHOME/exe/sander.MPI -O -i /data/apps/amber/test/polyAT_vac_md1_nocut.in -o /data/apps/amber/test/polyAT_vac_md1_nocut_mpich2.out -c /data/apps/amber/test/polyAT_vac_init_min.rst -p /data/apps/amber/test/polyAT_vac.prmtop -r /data/apps/amber/test/polyAT_vac_md1_nocut_mpich2.rst -x /data/apps/amber/test/polyAT_vac_md1_nocut_mpich2.mdcrd
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2[cli_2]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3[cli_3]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1[cli_1]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
Frac coord min, max: -2.111647559080276E-005 0.999587572668685
Frac coord min, max: -2.111647559080276E-005 0.999587572668685
The system has extended beyond
The system has extended beyond
     the extent of the virtual box.
Restarting sander will recalculate
    a new virtual box with 30 Angstroms
    extra on each side, if there is a
     the extent of the virtual box.
    restart file for this configuration.
SANDER BOMB in subroutine Routine: map_coords (ew_force.f)
Atom out of bounds. If a restart has been written,
Restarting sander will recalculate
restarting should resolve the error
    a new virtual box with 30 Angstroms
Frac coord min, max: -2.111647559080276E-005 0.999587572668685
The system has extended beyond
     the extent of the virtual box.
Restarting sander will recalculate
    a new virtual box with 30 Angstroms
    extra on each side, if there is a
    restart file for this configuration.
SANDER BOMB in subroutine Routine: map_coords (ew_force.f)
Atom out of bounds. If a restart has been written,
restarting should resolve the error
    extra on each side, if there is a
    restart file for this configuration.
SANDER BOMB in subroutine Routine: map_coords (ew_force.f)
Atom out of bounds. If a restart has been written,
restarting should resolve the error
rank 2 in job 2 node6.abicluster_39939 caused collective abort of all ranks
  exit status of rank 2: return code 1
rank 0 in job 2 node6.abicluster_39939 caused collective abort of all ranks
  exit status of rank 0: killed by signal 9

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Fri Apr 18 2008 - 21:19:56 PDT