[AMBER] Fwd: pmemd.MPI fails to run

From: Fabian Glaser <fabian.glaser.gmail.com>
Date: Mon, 29 Dec 2014 15:28:09 +0200

Hi again,

We tried the - - hostfile idea but it didn’t work…. see the error below, have you seen something like this?

Cluster experts are trying to compile with intelmpi but also have problems with the parallel compilation…. please see their email below, and if you can please send us suggestions on how to compile properly with intelmpi…..

Thansk!!!


1) - - hostfile error we got:
=====

I used this running command within the PBS file:

mpirun -np 24 --hostfile $PBS_NODEFILE pmemd.MPI -O -i min.in -o min.out -p ../hep1_system_ETA_ETA_1.prmtop -c ../hep1_system_ETA_ETA_1.prmcrd -r min.rst -ref ../hep1_system_ETA_ETA_1.prmcrd

and the output is

-bash-4.1$ more ETA_1_min5.e1076441
bash: orted: command not found
--------------------------------------------------------------------------
A daemon (pid 30302) died unexpectedly with status 127 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------


2) Bellow is their compilation trial errors …..

Thansk!!


_______________________________
Fabian Glaser, PhD

Technion - Israel Institute of Technology
Haifa 32000, ISRAEL

fglaser.technion.ac.il
Tel: +972 4 8293701
Fax: +972 4 8225153

> Begin forwarded message:
>
> Date: December 29, 2014 at 2:30:16 PM GMT+2
> From: HPC <hpc.technion.ac.il>
> To: Fabian Glaser <fabian.glaser.GMAIL.COM>
> Cc: HPC-SUPPORT-L.LISTSERV.TECHNION.AC.IL
> Subject: Re: Fwd: [AMBER] pmemd.MPI fails to run
>
> Fabian,
>
> I managed run properly the configuration using the Intel compilers after doing settings recommended in the Intel site.
> The serial compilation run properly this time, but the parallel part of the compilation failed.
> After doing :
>
> ./configure -intelmpi intel
> make install
>
> The compilation ended with the error message below.
> Please send it to the support person.
>
>
>
> mpiifort -DBINTRAJ -DEMIL -DMPI -c -FR -ip -O3 -I/usr/local/amber14_intel/include -I/usr/local/amber14_intel/include -o pub_fft.o pub_fft.F90
> mpiicpc -shared-intel -o cpptraj ActionFrameCounter.o ActionList.o Action_Angle.o Action_AtomMap.o Action_AtomicCorr.o Action_AtomicFluct.o Action_AutoImage.o Action_Average.o Action_Bounds.o Action_Box.o Action_Center.o Action_CheckStructure.o Action_Closest.o Action_ClusterDihedral.o Action_Contacts.o Action_CreateCrd.o Action_CreateReservoir.o Action_DNAionTracker.o Action_DSSP.o Action_Density.o Action_Diffusion.o Action_Dihedral.o Action_DihedralScan.o Action_Dipole.o Action_DistRmsd.o Action_Distance.o Action_FilterByData.o Action_FixAtomOrder.o Action_Gist.o Action_Grid.o Action_GridFreeEnergy.o Action_Hbond.o Action_Image.o Action_Jcoupling.o Action_LESsplit.o Action_LIE.o Action_MakeStructure.o Action_Mask.o Action_Matrix.o Action_Molsurf.o Action_MultiDihedral.o Action_NAstruct.o Action_NativeContacts.o Action_NMRrst.o Action_OrderParameter.o Action_Outtraj.o Action_PairDist.o Action_Pairwise.o Action_Principal.o Action_Projection.o Action_Pucker.o Action_Radgyr.o Action_Radial.o Action_RandomizeIons.o Action_Rmsd.o Action_Rotate.o Action_Rotdif.o Action_RunningAvg.o Action_STFC_Diffusion.o Action_Scale.o Action_SetVelocity.o Action_Spam.o Action_Strip.o Action_Surf.o Action_SymmetricRmsd.o Action_Temperature.o Action_Translate.o Action_Unwrap.o Action_Vector.o Action_VelocityAutoCorr.o Action_Volmap.o Action_Watershell.o AnalysisList.o Analysis_AmdBias.o Analysis_AutoCorr.o Analysis_Average.o Analysis_Clustering.o Analysis_Corr.o Analysis_CrankShaft.o Analysis_CrdFluct.o Analysis_CrossCorr.o Analysis_Divergence.o Analysis_FFT.o Analysis_Hist.o Analysis_Integrate.o Analysis_IRED.o Analysis_KDE.o Analysis_Lifetime.o Analysis_Matrix.o Analysis_MeltCurve.o Analysis_Modes.o Analysis_MultiHist.o Analysis_Overlap.o Analysis_Regression.o Analysis_RemLog.o Analysis_Rms2d.o Analysis_RmsAvgCorr.o Analysis_RunningAvg.o Analysis_Spline.o Analysis_Statistics.o Analysis_Timecorr.o Analysis_VectorMath.o ArgList.o Array1D.o Atom.o AtomMap.o AtomMask.o AxisType.o Box.o BufferedFrame.o BufferedLine.o ByteRoutines.o CIFfile.o ClusterDist.o ClusterList.o ClusterMatrix.o ClusterNode.o ClusterSieve.o Cluster_DBSCAN.o Cluster_HierAgglo.o Command.o ComplexArray.o Corr.o Cpptraj.o CpptrajFile.o CpptrajState.o CpptrajStdio.o DataFile.o DataFileList.o DataIO.o DataIO_Evecs.o DataIO_Gnuplot.o DataIO_Grace.o DataIO_OpenDx.o DataIO_Mdout.o DataIO_RemLog.o DataIO_Std.o DataIO_Xplor.o DataSet.o DataSetList.o DataSet_1D.o DataSet_2D.o DataSet_3D.o DataSet_Coords.o DataSet_Coords_CRD.o DataSet_Coords_TRJ.o DataSet_GridFlt.o DataSet_MatrixDbl.o DataSet_MatrixFlt.o DataSet_Mesh.o DataSet_Modes.o DataSet_RemLog.o DataSet_Vector.o DataSet_double.o DataSet_float.o DataSet_integer.o DataSet_string.o DihedralSearch.o Dimension.o DistRoutines.o FileIO_Bzip2.o FileIO_Gzip.o FileIO_Mpi.o FileIO_Std.o FileName.o FileTypes.o Frame.o FrameList.o GridAction.o Hungarian.o ImageRoutines.o MapAtom.o MaskToken.o Matrix_3x3.o Mol2File.o NameType.o NetcdfFile.o PDBfile.o ParmFile.o Parm_Amber.o Parm_CharmmPsf.o Parm_CIF.o Parm_Mol2.o Parm_PDB.o Parm_SDF.o ProgressBar.o PubFFT.o Random.o Range.o ReadLine.o ReferenceAction.o ReferenceFrame.o SDFfile.o StringRoutines.o SymmetricRmsdCalc.o Timer.o Topology.o TopologyList.o TorsionRoutines.o Traj_AmberCoord.o Traj_AmberNetcdf.o Traj_AmberRestart.o Traj_AmberRestartNC.o Traj_Binpos.o Traj_CharmmDcd.o Traj_CIF.o Traj_Conflib.o Traj_GmxTrX.o Traj_Mol2File.o Traj_PDBfile.o Traj_SDF.o Traj_SQM.o TrajectoryFile.o Trajin.o TrajinList.o Trajin_Multi.o Trajin_Single.o Trajout.o TrajoutList.o Vec3.o main.o MpiRoutines.o molsurf.o pub_fft.o \
> -L/usr/local/amber14_intel/lib /usr/local/amber14_intel/lib/libnetcdf.a -lz -lbz2 -larpack -llapack -lblas -lifport -lifcore -lsvml readline/libreadline.a
> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined reference to `MPIR_CommGetAttr_fort'
> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined reference to `MPIR_TypeGetAttr'
> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined reference to `MPIR_Err_create_code'
> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigc4.so: undefined reference to `MPIR_Keyval_set_proxy'
> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined reference to `MPIR_CommSetAttr'
> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined reference to `MPID_Datatype_direct'
> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined reference to `MPIU_Handle_get_ptr_indirect'
> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigc4.so: undefined reference to `MPIR_Errhandler_set_cxx'
> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined reference to `MPIR_WinSetAttr'
> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined reference to `MPIR_Err_return_comm'
> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined reference to `MPIR_Grequest_set_lang_f77'
> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined reference to `MPIR_TypeSetAttr'
> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined reference to `MPID_Datatype_builtin'
> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined reference to `MPID_b_use_gettimeofday'
> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined reference to `i_malloc'
> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined reference to `MPIR_Type_free_impl'
> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined reference to `MPIR_WinGetAttr'
> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined reference to `MPIR_Add_finalize'
> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined reference to `MPID_Wtick'
> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined reference to `MPID_Type_contiguous'
> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined reference to `MPID_Wtime_todouble'
> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined reference to `MPID_Datatype_mem'
> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined reference to `MPID_Type_commit'
> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined reference to `MPID_Datatype_set_contents'
> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigc4.so: undefined reference to `MPIR_Op_set_cxx'
> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined reference to `i_free'
> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined reference to `MPIR_Process'
> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined reference to `MPIR_Err_preOrPostInit'
> make[3]: *** [cpptraj] Error 1
> make[3]: Leaving directory `/usr/local/amber14_intel/AmberTools/src/cpptraj/src'
> make[2]: *** [install_mpi] Error 2
> make[2]: Leaving directory `/usr/local/amber14_intel/AmberTools/src/cpptraj'
> make[1]: *** [parallel] Error 2
> make[1]: Leaving directory `/usr/local/amber14_intel/AmberTools/src'
> make: *** [install] Error 2
>
>
>
> HPC Support
> CIS - Technion
> On 12/29/2014 02:05 PM, Fabian Glaser wrote:
>> Here is the answer below my email, the problems is not the amber software, is the MPI implementation or its specific use on tamnun system see the --hostfile suggestion, somehow we have to tell PBS where to run the different threads, it seems the default is to run on the same node, that is probably the reason you find the same problem with other programs, see the answer below mine.
>>
>> This people are very very good technically, I guess they know very well what they are saying, I would try to follow their instructions, the problem is not in amber.
>>
>> Please try to give it some priority to solve this problem, I cannot do anything without this software.
>>
>> Thasnk,
>>
>> Fabian
>>
>>
>>
>>
>>> Begin forwarded message:
>>>
>>> Date: December 29, 2014 at 1:53:18 PM GMT+2
>>> From: Jason Swails <jason.swails.gmail.com> <mailto:jason.swails.gmail.com>
>>> To: AMBER Mailing List <amber.ambermd.org> <mailto:amber.ambermd.org>
>>> Cc: HPC <hpc.technion.ac.il> <mailto:hpc.technion.ac.il>
>>> Subject: Re: [AMBER] pmemd.MPI fails to run
>>> Reply-To: AMBER Mailing List <amber.ambermd.org> <mailto:amber.ambermd.org>
>>>
>>> On Mon, Dec 29, 2014 at 5:33 AM, Fabian Glaser <fabian.glaser.gmail.com> <mailto:fabian.glaser.gmail.com>
>>> wrote:
>>>
>>>> Thanks Bill,
>>>>
>>>> We use PBS queuing, and despite it looks that all internal variables of
>>>> the job seem to be ok (24 CPU from two different nodes) the same problem
>>>> occurs, the job is run in only one node 24 subjobs ( 2 jobs x CPU) and the
>>>> second one is empty, instead of starting 24 CPU in two nodes, here are the
>>>> output variables once the job ends, and their real value produced by PBS
>>>> output below. The PBS I use is also pasted. The amber output files by the
>>>> way are produced just fine, the problem is it does not spread the jobs in
>>>> more than one node.....
>>>>
>>>> So it seems the PBS queue is working correctly but something is not
>>>> allowing it to use two nodes, do you still think the problem is on the
>>>> system or you think we should recompile?
>>>>
>>>> Thanks a lot,
>>>>
>>>> Fabian
>>>>
>>>>
>>>> • PBS_O_HOST - name of the host upon which qsub command is running
>>>> • PBS_O_QUEUE - name of the original queue to which the job was
>>>> submitted
>>>> • PBS_O_WORKDIR - absolute path of the current working directory
>>>> of the qsub command
>>>> • PBS_ENVIRONMENT - set to PBS_BATCH to indicate the job is a
>>>> batch job, or to PBS_INTERACTIVE to indicate the job is a PBS interactive
>>>> job
>>>> • PBS_JOBID - the job identifier assigned to the job by the batch
>>>> system
>>>> • PBS_JOBNAME - the job name supplied by the user
>>>> • PBS_NODEFILE - the name of the file containing the list of nodes
>>>> assigned to the job
>>>> • PBS_QUEUE - the name of the queue from which the job is executed
>>>>
>>>> PBS output:
>>>> ===
>>>> -bash-4.1$ more ETA_1_min3.o1076308
>>>> /u/fglaser/projects/IsraelVlodavsky/hep1_v2/MD/ETA_1/min
>>>> tamnun.default.domain
>>>> all_l_p
>>>> /u/fglaser/projects/IsraelVlodavsky/hep1_v2/MD/ETA_1/min
>>>> PBS_BATCH
>>>> 1076308.tamnun
>>>> ETA_1_min3
>>>> all_l_p_exe
>>>> nodes (24 cpu total):
>>>> n032.default.domain
>>>> n034.default.domain
>>>>
>>>> PBS file
>>>> ======
>>>>
>>>> #!/bin/sh
>>>> #
>>>> # job name (default is the name of pbs script file)
>>>> #---------------------------------------------------
>>>> #PBS -N ETA_1_min3
>>>> # Submit the job to the queue "queue_name"
>>>> #---------------------------------------------------
>>>> #PBS -q all_l_p
>>>> # Send the mail messages (see below) to the specified user address
>>>> #-----------------------------------------------------------------
>>>> #PBS -M fglaser.technion.ac.il <mailto:fglaser.technion.ac.il>
>>>> # send me mail when the job begins
>>>> #---------------------------------------------------
>>>> #PBS -mbea
>>>> # resource limits: number and distribution of parallel processes
>>>> #------------------------------------------------------------------
>>>> #PBS -l select=2:ncpus=12:mpiprocs=12
>>>> #
>>>> # comment: this select statement means: use M chunks (nodes),
>>>> # use N (=< 12) CPUs for N mpi tasks on each of M nodes.
>>>> # "scatter" will use exactly N CPUs from each node, while omitting
>>>> # "-l place" statement will fill all available CPUs of M nodes
>>>> #
>>>> # specifying working directory
>>>> #------------------------------------------------------
>>>> echo $PBS_O_WORKDIR
>>>> echo $PBS_O_HOST
>>>> echo $PBS_O_QUEUE
>>>> echo $PBS_O_WORKDIR
>>>> echo $PBS_ENVIRONMENT
>>>> echo $PBS_JOBID
>>>> echo $PBS_JOBNAME
>>>> echo $PBS_QUEUE
>>>>
>>>>
>>>> cd $PBS_O_WORKDIR
>>>>
>>>>
>>>> # This finds out the number of nodes we have
>>>> NP=$(wc -l $PBS_NODEFILE | awk '{print $1}')
>>>> echo "nodes ($NP cpu total):"
>>>> sort $PBS_NODEFILE | uniq
>>>>
>>>>
>>>> export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:$LD_LIBRARY_PATH
>>>> export PATH=/usr/lib64/openmpi/bin:$PATH
>>>> source /usr/local/amber14/amber.sh
>>>>
>>>> #source /usr/local/amber14/setup.sh
>>>>
>>>> # running MPI executable with M*N processes
>>>> #------------------------------------------------------
>>>>
>>>> mpirun -np 24 pmemd.MPI -O -i min.in -o min.out -p
>>>> ../hep1_system_ETA_ETA_1.prmtop -c ../hep1_system_ETA_ETA_1.prmcrd -r
>>>> min.rst -ref ../hep1_system_ETA_ETA_1.prmcrd
>>>>
>>> ​A number of people have already pointed out what they think is happening
>>> (and I agree with them): you are not giving any instruction here to tell
>>> the MPI implementation WHERE to actually run those 24 threads. In some
>>> cases (depending on how your MPI is installed), this will mean that all 24
>>> threads are run on the same node. If this is happening, you need to
>>> provide a machinefile to mpirun to tell it exactly where to start all of
>>> the threads. In the case of OpenMPI, this can be done with --hostfile; so
>>> something like this:
>>>
>>> mpirun -np 24 --hostfile $PBS_NODEFILE pmemd.MPI -O ...
>>>
>>> ​That said​, the most common PBS implementation (Torque) provides an API so
>>> that applications can be made aware of the scheduler. The various MPI
>>> implementations (OpenMPI, MPICH, etc.) can all be built with Torque
>>> integration, which will make it *much* easier to use within the PBS
>>> environment. For example. on one of the HPC systems I've used in the past,
>>> I was able to use the command:
>>>
>>> mpiexec pmemd.MPI -O ...
>>>
>>> with no arguments at all to mpiexec/mpirun -- in this case, mpiexec was
>>> able to figure out how many threads to run and where to run them because it
>>> was integrated directly with the scheduler.
>>>
>>> HTH,
>>> Jason
>>>
>>> --
>>> Jason M. Swails
>>> BioMaPS,
>>> Rutgers University
>>> Postdoctoral Researcher
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org <mailto:AMBER.ambermd.org>
>>> http://lists.ambermd.org/mailman/listinfo/amber <http://lists.ambermd.org/mailman/listinfo/amber>
>

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Dec 29 2014 - 05:30:03 PST
Custom Search