RE: AMBER: Amber9 with MPICH2 failure at runtime

From: Ross Walker <ross.rosswalker.co.uk>
Date: Wed, 16 Apr 2008 15:05:26 -0700

Hi Sasha
 
This looks perfectly okay to meet - read the next part of the tutorial and it will explain why things blow up. The serial one should blow up as well - will just take longer (in wallclock time) to reach that point.
 
All the best
Ross

/\
\/
|\oss Walker

| Assistant Research Professor |
| San Diego Supercomputer Center |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
| http://www.rosswalker.co.uk <http://www.rosswalker.co.uk/> | PGP Key available on request |

Note: Electronic Mail is not secure, has no guarantee of delivery, may not be read every day, and should not be used for urgent or sensitive issues.

 


  _____

From: owner-amber.scripps.edu [mailto:owner-amber.scripps.edu] On Behalf Of Sasha Buzko
Sent: Wednesday, April 16, 2008 14:43
To: amber.scripps.edu
Subject: AMBER: Amber9 with MPICH2 failure at runtime


Hi,
I installed and tested MPICH2 on several cluster nodes, as well as compiled amber9 with MKL support and static linking. make test.parallel went fine, with the exception of a couple of possible failures (didn't follow up on those yet).
To test further, I used an example from an Amber tutorial (piece of DNA). When run as a serial Amber, all works fine and produces expected output. The parallel version, however, fails even when run on a single node (one entry in the mpd.hosts file). The output is below. I did run the resulting trajectory using Sirius, and it looked fine, except that it's incomplete, as opposed to the serial version output. Do you have any suggestions as to why this might be happening in the parallel version?

Thank you

Sasha


[sasha.node6 test]$ mpiexec -n 4 $AMBERHOME/exe/sander.MPI -O -i /data/apps/amber/test/polyAT_vac_md1_nocut.in -o /data/apps/amber/test/polyAT_vac_md1_nocut_mpich2.out -c /data/apps/amber/test/polyAT_vac_init_min.rst -p /data/apps/amber/test/polyAT_vac.prmtop -r /data/apps/amber/test/polyAT_vac_md1_nocut_mpich2.rst -x /data/apps/amber/test/polyAT_vac_md1_nocut_mpich2.mdcrd
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2[cli_2]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3[cli_3]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1[cli_1]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
Frac coord min, max: -2.111647559080276E-005 0.999587572668685
Frac coord min, max: -2.111647559080276E-005 0.999587572668685
The system has extended beyond
The system has extended beyond
     the extent of the virtual box.
Restarting sander will recalculate
    a new virtual box with 30 Angstroms
    extra on each side, if there is a
     the extent of the virtual box.
    restart file for this configuration.
SANDER BOMB in subroutine Routine: map_coords (ew_force.f)
Atom out of bounds. If a restart has been written,
Restarting sander will recalculate
restarting should resolve the error
    a new virtual box with 30 Angstroms
Frac coord min, max: -2.111647559080276E-005 0.999587572668685
The system has extended beyond
     the extent of the virtual box.
Restarting sander will recalculate
    a new virtual box with 30 Angstroms
    extra on each side, if there is a
    restart file for this configuration.
SANDER BOMB in subroutine Routine: map_coords (ew_force.f)
Atom out of bounds. If a restart has been written,
restarting should resolve the error
    extra on each side, if there is a
    restart file for this configuration.
SANDER BOMB in subroutine Routine: map_coords (ew_force.f)
Atom out of bounds. If a restart has been written,
restarting should resolve the error
rank 2 in job 2 node6.abicluster_39939 caused collective abort of all ranks
  exit status of rank 2: return code 1
rank 0 in job 2 node6.abicluster_39939 caused collective abort of all ranks
  exit status of rank 0: killed by signal 9




-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Fri Apr 18 2008 - 21:19:55 PDT
Custom Search