[AMBER] AMBER10 pimd failure on =>16 cores

From: Andrew Petersen <aapeters.ncsu.edu>
Date: Wed, 27 Jul 2011 17:22:02 -0400

Hello users,
I need help with running sander.MPI (AMBER10) for pimd calculations on many
cores. The cluster I use consists of blades and cores.

A classical mechanical simulation that uses
mpiexec_hydra $AMBERHOME/exe/sander.MPI -O -i min.in ...

does FINE with 32 cores. I set up looping jobs, and they work properly 100%
of the time.

However if I add a line to the job file, which then does an additional PIMD
job:
mpiexec_hydra $AMBERHOME/exe/sander.MPI -ng xx -groupfile test2.file

The job works for 4, 8 cores, sometimes for 16 cores (xx=4, 8, 16). It
almost always fails at 32cores (Trotter #=32). At one time I suspected the
problem was that 16 or 32 bead jobs sometimes had to communicate across
blades, but the evidence does not support that idea, since many 16 core jobs
fail even when all 16 cores are on the same blade. As a matter of fact,
looking at the error files from looping jobs, the same job will work on a
specific blade, then fail later on that same blade. The 16core looping jobs
gave a success rate of ~33% (10/28jobs)

I need to be able narrow down where the problem is. It could be 1) a
problem with sander.MPI when it is doing PIMD 2) a problem with the
hardware/mpi of my resource that shows up with PIMD sander.MPI and not with
classical sander.MPI. 3) a problem with my input data that causes the PIMD
code to be unstable with larger #'s of beads.

Anyone have any idea where the problem is?

Could someone please:
Run a 32 or 64 core pimd job with the attached inputs (quick 10 step job)
and tell me if it works in a robust manner? (the input data was taken from
the installation ~/test/ directory, modified for use with 32 cores). This
will help me to narrow down the problem.

The intel compiler was used, with version 9.1.023 MKL library.

Thank you
Andrew


_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Wed Jul 27 2011 - 14:30:03 PDT
Custom Search