Dear Jason,
Thanks for the reply!
I probably should have elaborated a little more - I did also try using the
$PBS_NODEFILE, i.e. with
DO_PARALLEL="$AMBERHOME/bin/mpirun -np 2 --hostfile \$PBS_NODEFILE"
When I do 'make test' with this, I get the following:
Open RTE was unable to open the hostfile:
$PBS_NODEFILE
Check to make sure the path and filename are correct.
And, as I wrote in the original post, a job-submission file with a single
sander.MPI job in it did run and complete as normal, both with and without
specifying the hostfile.
However, when I use the equivalent job submission file to run the
cytosine-test:
#!/bin/bash -f
#
#PBS -l nodes=1:ppn=4
#PBS -q veryshort
#PBS -N test
#PBS -j oe
#
export AMBERHOME="/users/chmwvdk/amber12"
cd $AMBERHOME/test/cytosine/
export MYEXE="$AMBERHOME/bin/sander.MPI"
mpirun -np 4 $MYEXE -O -i in.md -c crd.md.23 -o cytosine.out
I get the same behaviour as with 'make test', i.e. the job 'stalls' after
section 4 - no TIMINGS are printed; the same is true when I include
"--hostfile $PBS_NODEFILE" in the above job submission file. (BTW, I do get
the "Final Performance Info" in mdinfo)
What can this be caused by?
Thanks again,
Marc
On 16 April 2012 20:05, Jason Swails <jason.swails.gmail.com> wrote:
> On Mon, Apr 16, 2012 at 10:15 AM, Marc van der Kamp <
> marcvanderkamp.gmail.com> wrote:
>
> > Hi,
> >
> > I have a problem running the tests for the parallel executables. (My
> > apologies if this is not an AMBER issue per se)
> >
> > I have:
> > compiled & tested AMBER12 in serial (with success)
> > downloaded and compiled openmpi-1.5.4 (using "./configure_openmpi -gnu"
> in
> > $AMBERHOME/AmberTools/src/)
> > compiled AMBER12 in parallel (with success, it seems)
> >
> > Now, I'd like to test AMBER12 in parallel.
> > First, I tried simply doing:
> >
> > cd $AMBERHOME
> > make test
> >
> > The first series of tests that run are in AmberTools/test/nab and
> > these tests are fine.
> > Then, when trying to run tests in AmberTools/test/mmpbsa_py I get a bunch
> > of errors like this:
> >
> > make[3]: Entering directory
> > `/export/users/chmwvdk/amber12/AmberTools/test/mmpbsa_py'
> > cd EstRAL_Files && ./Run.makeparms
> > This is not a parallel test.
> > cd 01_Generalized_Born && ./Run.GB
> > [curie.chm.bris.ac.uk:23918] [[32424,1],0] ORTE_ERROR_LOG: Data unpack
> > would read past end of buffer in file util/nidmap.c at line 371
> > [curie.chm.bris.ac.uk:23918] [[32424,1],0] ORTE_ERROR_LOG: Data unpack
> > would read past end of buffer in file base/ess_base_nidmap.c at line 62
> > [curie.chm.bris.ac.uk:23918] [[32424,1],0] ORTE_ERROR_LOG: Data unpack
> > would read past end of buffer in file ess_env_module.c at line 173
> >
> --------------------------------------------------------------------------
> > It looks like orte_init failed for some reason; your parallel process is
> > likely to abort. There are many reasons that a parallel process can
> > fail during orte_init; some of which are due to configuration or
> > environment problems. This failure appears to be an internal failure;
> > here's some additional information (which may only be relevant to an
> > Open MPI developer):
> >
> > orte_ess_base_build_nidmap failed
> > --> Returned value Data unpack would read past end of buffer (-26)
> > instead of ORTE_SUCCESS
> >
> --------------------------------------------------------------------------
> > ...
> > ...
> >
> > This is unfortunate, but not really of great importance to me, as the
> > parallel executables of the AMBER12 programs (not AmberTools12) is what
> I'm
> > really after.
> > However, when I get to this stage, the process simply 'hangs' after:
> > make[2]: Entering directory `/export/users/chmwvdk/amber12/test'
> > export TESTsander=/users/chmwvdk/amber12/bin/sander.MPI; make -k
> > test.sander.BASIC
> > make[3]: Entering directory `/export/users/chmwvdk/amber12/test'
> > cd cytosine && ./Run.cytosine
> >
> >
> > It turns out that the cytosine test has run fine, writing up to 'section
> 4'
> > in the output file (values in cytosine/cytosine.out and
> > cytosine/cytosine.out.save are identical), but nothing happens after
> that.
> > In other words, the process seems to hang on getting the TIMINGS
> > information (section 5) in the output file.
> >
> > I checked that my environment was OK (as far as I can tell), i.e.
> > $PATH contains $AMBERHOME/bin
> > $LD_LIBRARY_PATH contains $AMBERHOME/lib
> > $MPI_HOME is set to $AMBERHOME
> > 'which mpirun' gives $AMBERHOME/bin/mpirun
> > 'which mpicc' gives $AMBERHOME/bin/mpicc
> >
> > This installation is on a cluster, and I'm not sure if it is set up to
> run
> > in parallel on the headnode, so I also made a PBS submission script (as
> > suggested in http://archive.ambermd.org/200701/0112.html ) and
> submitted
> > that.
> > This is the submission script:
> >
>
> Ah, sometimes clusters can be trickier.
>
>
> > #!/bin/bash
> > #
> > #PBS -l walltime=5:0:0,nodes=1:ppn=4
> > #PBS -q veryshort
> > #PBS -N parallel_test
> > #PBS -j oe
> >
> > export DO_PARALLEL="mpirun -np 4 "
> >
>
> Try setting DO_PARALLEL to something that uses the PBS_NODEFILE. Maybe
> something like this:
>
> export DO_PARALLEL="mpirun -hostfile $PBS_NODEFILE"
>
> (the -hostfile flag may differ depending on your MPI implementation -- see
> the mpirun/mpiexec man pages).
>
> As an aside, I notice that the 'header' of output files from AMBER12 still
> > reads:
> >
> > -------------------------------------------------------
> > Amber 11 SANDER 2010
> > -------------------------------------------------------
> >
> > That should probably be updated...
> >
>
> It is -- bugfix.2 for Amber 12. You can run:
>
> $AMBERHOME/patch_amber.py --update
>
> to download and apply all patches.
>
> HTH,
> Jason
>
> --
> Jason M. Swails
> Quantum Theory Project,
> University of Florida
> Ph.D. Candidate
> 352-392-4032
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Apr 17 2012 - 02:00:03 PDT