RE: AMBER: Amber9 with MPICH2 failure at runtime

From: Sasha Buzko <obuzko.ucla.edu>
Date: Wed, 16 Apr 2008 15:43:12 -0700

Thanks, Ross.
I guess what caused my suspicions is that the serial version didn't blow
up, but completed the trajectory with no incident. Anyway, I ran the 12A
cutoff bounded version, and that worked.
I do have a performance-related question, though. Our cluster was
initially designed for docking jobs that are easily splittable and the
nodes don't need to have fast interconnects since there is no
cross-talk. So the connections are gigabit ethernet. I noticed that
running the same example using sander.MPI on 4 nodes takes about twice
the time it did on one. Do you think using PMEMD could somehow alleviate
this issue or is it hopeless to run amber without infiniband
interconnects?

Thanks

Sasha



On Wed, 2008-04-16 at 15:05 -0700, Ross Walker wrote:

> 
>
> Hi Sasha
>
> This looks perfectly okay to meet - read the next part of the tutorial
> and it will explain why things blow up. The serial one should blow up
> as well - will just take longer (in wallclock time) to reach that
> point.
>
> All the best
> Ross
>
>
> /\
> \/
> |\oss Walker
>
> | Assistant Research Professor |
> | San Diego Supercomputer Center |
> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
> | http://www.rosswalker.co.uk | PGP Key available on request |
>
> Note: Electronic Mail is not secure, has no guarantee of delivery, may
> not be read every day, and should not be used for urgent or sensitive
> issues.
>
>
>
>
>
>
>
> ______________________________________________________________
> From: owner-amber.scripps.edu [mailto:owner-amber.scripps.edu]
> On Behalf Of Sasha Buzko
> Sent: Wednesday, April 16, 2008 14:43
> To: amber.scripps.edu
> Subject: AMBER: Amber9 with MPICH2 failure at runtime
>
>
>
>
> Hi,
> I installed and tested MPICH2 on several cluster nodes, as
> well as compiled amber9 with MKL support and static linking.
> make test.parallel went fine, with the exception of a couple
> of possible failures (didn't follow up on those yet).
> To test further, I used an example from an Amber tutorial
> (piece of DNA). When run as a serial Amber, all works fine and
> produces expected output. The parallel version, however, fails
> even when run on a single node (one entry in the mpd.hosts
> file). The output is below. I did run the resulting trajectory
> using Sirius, and it looked fine, except that it's incomplete,
> as opposed to the serial version output. Do you have any
> suggestions as to why this might be happening in the parallel
> version?
>
> Thank you
>
> Sasha
>
>
> [sasha.node6 test]$ mpiexec -n 4 $AMBERHOME/exe/sander.MPI -O
> -i /data/apps/amber/test/polyAT_vac_md1_nocut.in
> -o /data/apps/amber/test/polyAT_vac_md1_nocut_mpich2.out
> -c /data/apps/amber/test/polyAT_vac_init_min.rst
> -p /data/apps/amber/test/polyAT_vac.prmtop
> -r /data/apps/amber/test/polyAT_vac_md1_nocut_mpich2.rst
> -x /data/apps/amber/test/polyAT_vac_md1_nocut_mpich2.mdcrd
> application called MPI_Abort(MPI_COMM_WORLD, 1) - process
> 0[cli_0]: aborting job:
> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
> application called MPI_Abort(MPI_COMM_WORLD, 1) - process
> 2[cli_2]: aborting job:
> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
> application called MPI_Abort(MPI_COMM_WORLD, 1) - process
> 3[cli_3]: aborting job:
> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
> application called MPI_Abort(MPI_COMM_WORLD, 1) - process
> 1[cli_1]: aborting job:
> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
> Frac coord min, max: -2.111647559080276E-005
> 0.999587572668685
> Frac coord min, max: -2.111647559080276E-005
> 0.999587572668685
> The system has extended beyond
> The system has extended beyond
> the extent of the virtual box.
> Restarting sander will recalculate
> a new virtual box with 30 Angstroms
> extra on each side, if there is a
> the extent of the virtual box.
> restart file for this configuration.
> SANDER BOMB in subroutine Routine: map_coords (ew_force.f)
> Atom out of bounds. If a restart has been written,
> Restarting sander will recalculate
> restarting should resolve the error
> a new virtual box with 30 Angstroms
> Frac coord min, max: -2.111647559080276E-005
> 0.999587572668685
> The system has extended beyond
> the extent of the virtual box.
> Restarting sander will recalculate
> a new virtual box with 30 Angstroms
> extra on each side, if there is a
> restart file for this configuration.
> SANDER BOMB in subroutine Routine: map_coords (ew_force.f)
> Atom out of bounds. If a restart has been written,
> restarting should resolve the error
> extra on each side, if there is a
> restart file for this configuration.
> SANDER BOMB in subroutine Routine: map_coords (ew_force.f)
> Atom out of bounds. If a restart has been written,
> restarting should resolve the error
> rank 2 in job 2 node6.abicluster_39939 caused collective
> abort of all ranks
> exit status of rank 2: return code 1
> rank 0 in job 2 node6.abicluster_39939 caused collective
> abort of all ranks
> exit status of rank 0: killed by signal 9
>

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Fri Apr 18 2008 - 21:19:55 PDT
Custom Search