Re: [AMBER] Fwd: pmemd.MPI fails to run

From: Ross Walker <ross.rosswalker.co.uk>
Date: Mon, 29 Dec 2014 08:05:21 -0800

Hi Fabian,

This problem is completely separate from AMBER here. It is a very basic
misunderstanding of how to run MPI jobs. I would start by reading the
manual for your cluster, the queuing system and the MPI installation you
are using. Then take a simple MPI Hello World example and get that running
before trying something complicated such as the AMBER suite. There should
be examples in the manual for your MPI installation.

Simply trying different MPI installations is not going to work, unless you
randomly stumble upon a working configuration. It really needs someone who
understands how clusters are setup / work or you may need to consider
hiring a consultant to setup your cluster and train your team.

All the best
Ross


On 12/29/14, 5:28 AM, "Fabian Glaser" <fabian.glaser.gmail.com> wrote:

>Hi again,
>
>We tried the - - hostfile idea but it didn’t work…. see the error below,
>have you seen something like this?
>
>Cluster experts are trying to compile with intelmpi but also have
>problems with the parallel compilation…. please see their email below,
>and if you can please send us suggestions on how to compile properly with
>intelmpi…..
>
>Thansk!!!
>
>
>1) - - hostfile error we got:
>=====
>
>I used this running command within the PBS file:
>
>mpirun -np 24 --hostfile $PBS_NODEFILE pmemd.MPI -O -i min.in -o min.out
>-p ../hep1_system_ETA_ETA_1.prmtop -c ../hep1_system_ETA_ETA_1.prmcrd -r
>min.rst -ref ../hep1_system_ETA_ETA_1.prmcrd
>
>and the output is
>
>-bash-4.1$ more ETA_1_min5.e1076441
>bash: orted: command not found
>--------------------------------------------------------------------------
>A daemon (pid 30302) died unexpectedly with status 127 while attempting
>to launch so we are aborting.
>
>There may be more information reported by the environment (see above).
>
>This may be because the daemon was unable to find all the needed shared
>libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>location of the shared libraries on the remote nodes and this will
>automatically be forwarded to the remote nodes.
>--------------------------------------------------------------------------
>--------------------------------------------------------------------------
>mpirun noticed that the job aborted, but has no info as to the process
>that caused that situation.
>--------------------------------------------------------------------------
>
>
>2) Bellow is their compilation trial errors …..
>
>Thansk!!
>
>
>_______________________________
>Fabian Glaser, PhD
>
>Technion - Israel Institute of Technology
>Haifa 32000, ISRAEL
>
>fglaser.technion.ac.il
>Tel: +972 4 8293701
>Fax: +972 4 8225153
>
>> Begin forwarded message:
>>
>> Date: December 29, 2014 at 2:30:16 PM GMT+2
>> From: HPC <hpc.technion.ac.il>
>> To: Fabian Glaser <fabian.glaser.GMAIL.COM>
>> Cc: HPC-SUPPORT-L.LISTSERV.TECHNION.AC.IL
>> Subject: Re: Fwd: [AMBER] pmemd.MPI fails to run
>>
>> Fabian,
>>
>> I managed run properly the configuration using the Intel compilers
>>after doing settings recommended in the Intel site.
>> The serial compilation run properly this time, but the parallel part of
>>the compilation failed.
>> After doing :
>>
>> ./configure -intelmpi intel
>> make install
>>
>> The compilation ended with the error message below.
>> Please send it to the support person.
>>
>>
>>
>> mpiifort -DBINTRAJ -DEMIL -DMPI -c -FR -ip -O3
>>-I/usr/local/amber14_intel/include -I/usr/local/amber14_intel/include
>>-o pub_fft.o pub_fft.F90
>> mpiicpc -shared-intel -o cpptraj ActionFrameCounter.o ActionList.o
>>Action_Angle.o Action_AtomMap.o Action_AtomicCorr.o Action_AtomicFluct.o
>>Action_AutoImage.o Action_Average.o Action_Bounds.o Action_Box.o
>>Action_Center.o Action_CheckStructure.o Action_Closest.o
>>Action_ClusterDihedral.o Action_Contacts.o Action_CreateCrd.o
>>Action_CreateReservoir.o Action_DNAionTracker.o Action_DSSP.o
>>Action_Density.o Action_Diffusion.o Action_Dihedral.o
>>Action_DihedralScan.o Action_Dipole.o Action_DistRmsd.o
>>Action_Distance.o Action_FilterByData.o Action_FixAtomOrder.o
>>Action_Gist.o Action_Grid.o Action_GridFreeEnergy.o Action_Hbond.o
>>Action_Image.o Action_Jcoupling.o Action_LESsplit.o Action_LIE.o
>>Action_MakeStructure.o Action_Mask.o Action_Matrix.o Action_Molsurf.o
>>Action_MultiDihedral.o Action_NAstruct.o Action_NativeContacts.o
>>Action_NMRrst.o Action_OrderParameter.o Action_Outtraj.o
>>Action_PairDist.o Action_Pairwise.o Action_Principal.o
>>Action_Projection.o Action_Pucker.o Action_Radgyr.o Action_Radial.o
>>Action_RandomizeIons.o Action_Rmsd.o Action_Rotate.o Action_Rotdif.o
>>Action_RunningAvg.o Action_STFC_Diffusion.o Action_Scale.o
>>Action_SetVelocity.o Action_Spam.o Action_Strip.o Action_Surf.o
>>Action_SymmetricRmsd.o Action_Temperature.o Action_Translate.o
>>Action_Unwrap.o Action_Vector.o Action_VelocityAutoCorr.o
>>Action_Volmap.o Action_Watershell.o AnalysisList.o Analysis_AmdBias.o
>>Analysis_AutoCorr.o Analysis_Average.o Analysis_Clustering.o
>>Analysis_Corr.o Analysis_CrankShaft.o Analysis_CrdFluct.o
>>Analysis_CrossCorr.o Analysis_Divergence.o Analysis_FFT.o
>>Analysis_Hist.o Analysis_Integrate.o Analysis_IRED.o Analysis_KDE.o
>>Analysis_Lifetime.o Analysis_Matrix.o Analysis_MeltCurve.o
>>Analysis_Modes.o Analysis_MultiHist.o Analysis_Overlap.o
>>Analysis_Regression.o Analysis_RemLog.o Analysis_Rms2d.o
>>Analysis_RmsAvgCorr.o Analysis_RunningAvg.o Analysis_Spline.o
>>Analysis_Statistics.o Analysis_Timecorr.o Analysis_VectorMath.o
>>ArgList.o Array1D.o Atom.o AtomMap.o AtomMask.o AxisType.o Box.o
>>BufferedFrame.o BufferedLine.o ByteRoutines.o CIFfile.o ClusterDist.o
>>ClusterList.o ClusterMatrix.o ClusterNode.o ClusterSieve.o
>>Cluster_DBSCAN.o Cluster_HierAgglo.o Command.o ComplexArray.o Corr.o
>>Cpptraj.o CpptrajFile.o CpptrajState.o CpptrajStdio.o DataFile.o
>>DataFileList.o DataIO.o DataIO_Evecs.o DataIO_Gnuplot.o DataIO_Grace.o
>>DataIO_OpenDx.o DataIO_Mdout.o DataIO_RemLog.o DataIO_Std.o
>>DataIO_Xplor.o DataSet.o DataSetList.o DataSet_1D.o DataSet_2D.o
>>DataSet_3D.o DataSet_Coords.o DataSet_Coords_CRD.o DataSet_Coords_TRJ.o
>>DataSet_GridFlt.o DataSet_MatrixDbl.o DataSet_MatrixFlt.o DataSet_Mesh.o
>>DataSet_Modes.o DataSet_RemLog.o DataSet_Vector.o DataSet_double.o
>>DataSet_float.o DataSet_integer.o DataSet_string.o DihedralSearch.o
>>Dimension.o DistRoutines.o FileIO_Bzip2.o FileIO_Gzip.o FileIO_Mpi.o
>>FileIO_Std.o FileName.o FileTypes.o Frame.o FrameList.o GridAction.o
>>Hungarian.o ImageRoutines.o MapAtom.o MaskToken.o Matrix_3x3.o
>>Mol2File.o NameType.o NetcdfFile.o PDBfile.o ParmFile.o Parm_Amber.o
>>Parm_CharmmPsf.o Parm_CIF.o Parm_Mol2.o Parm_PDB.o Parm_SDF.o
>>ProgressBar.o PubFFT.o Random.o Range.o ReadLine.o ReferenceAction.o
>>ReferenceFrame.o SDFfile.o StringRoutines.o SymmetricRmsdCalc.o Timer.o
>>Topology.o TopologyList.o TorsionRoutines.o Traj_AmberCoord.o
>>Traj_AmberNetcdf.o Traj_AmberRestart.o Traj_AmberRestartNC.o
>>Traj_Binpos.o Traj_CharmmDcd.o Traj_CIF.o Traj_Conflib.o Traj_GmxTrX.o
>>Traj_Mol2File.o Traj_PDBfile.o Traj_SDF.o Traj_SQM.o TrajectoryFile.o
>>Trajin.o TrajinList.o Trajin_Multi.o Trajin_Single.o Trajout.o
>>TrajoutList.o Vec3.o main.o MpiRoutines.o molsurf.o pub_fft.o \
>> -L/usr/local/amber14_intel/lib
>>/usr/local/amber14_intel/lib/libnetcdf.a -lz -lbz2 -larpack
>>-llapack -lblas -lifport -lifcore -lsvml readline/libreadline.a
>> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined
>>reference to `MPIR_CommGetAttr_fort'
>> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined
>>reference to `MPIR_TypeGetAttr'
>> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined
>>reference to `MPIR_Err_create_code'
>> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigc4.so: undefined
>>reference to `MPIR_Keyval_set_proxy'
>> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined
>>reference to `MPIR_CommSetAttr'
>> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined
>>reference to `MPID_Datatype_direct'
>> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined
>>reference to `MPIU_Handle_get_ptr_indirect'
>> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigc4.so: undefined
>>reference to `MPIR_Errhandler_set_cxx'
>> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined
>>reference to `MPIR_WinSetAttr'
>> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined
>>reference to `MPIR_Err_return_comm'
>> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined
>>reference to `MPIR_Grequest_set_lang_f77'
>> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined
>>reference to `MPIR_TypeSetAttr'
>> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined
>>reference to `MPID_Datatype_builtin'
>> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined
>>reference to `MPID_b_use_gettimeofday'
>> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined
>>reference to `i_malloc'
>> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined
>>reference to `MPIR_Type_free_impl'
>> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined
>>reference to `MPIR_WinGetAttr'
>> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined
>>reference to `MPIR_Add_finalize'
>> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined
>>reference to `MPID_Wtick'
>> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined
>>reference to `MPID_Type_contiguous'
>> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined
>>reference to `MPID_Wtime_todouble'
>> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined
>>reference to `MPID_Datatype_mem'
>> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined
>>reference to `MPID_Type_commit'
>> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined
>>reference to `MPID_Datatype_set_contents'
>> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigc4.so: undefined
>>reference to `MPIR_Op_set_cxx'
>> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined
>>reference to `i_free'
>> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined
>>reference to `MPIR_Process'
>> /usr/local/intel14//impi/4.1.3.048/intel64/lib/libmpigf.so: undefined
>>reference to `MPIR_Err_preOrPostInit'
>> make[3]: *** [cpptraj] Error 1
>> make[3]: Leaving directory
>>`/usr/local/amber14_intel/AmberTools/src/cpptraj/src'
>> make[2]: *** [install_mpi] Error 2
>> make[2]: Leaving directory
>>`/usr/local/amber14_intel/AmberTools/src/cpptraj'
>> make[1]: *** [parallel] Error 2
>> make[1]: Leaving directory `/usr/local/amber14_intel/AmberTools/src'
>> make: *** [install] Error 2
>>
>>
>>
>> HPC Support
>> CIS - Technion
>> On 12/29/2014 02:05 PM, Fabian Glaser wrote:
>>> Here is the answer below my email, the problems is not the amber
>>>software, is the MPI implementation or its specific use on tamnun
>>>system see the --hostfile suggestion, somehow we have to tell PBS
>>>where to run the different threads, it seems the default is to run on
>>>the same node, that is probably the reason you find the same problem
>>>with other programs, see the answer below mine.
>>>
>>> This people are very very good technically, I guess they know very
>>>well what they are saying, I would try to follow their instructions,
>>>the problem is not in amber.
>>>
>>> Please try to give it some priority to solve this problem, I cannot do
>>>anything without this software.
>>>
>>> Thasnk,
>>>
>>> Fabian
>>>
>>>
>>>
>>>
>>>> Begin forwarded message:
>>>>
>>>> Date: December 29, 2014 at 1:53:18 PM GMT+2
>>>> From: Jason Swails <jason.swails.gmail.com>
>>>><mailto:jason.swails.gmail.com>
>>>> To: AMBER Mailing List <amber.ambermd.org> <mailto:amber.ambermd.org>
>>>> Cc: HPC <hpc.technion.ac.il> <mailto:hpc.technion.ac.il>
>>>> Subject: Re: [AMBER] pmemd.MPI fails to run
>>>> Reply-To: AMBER Mailing List <amber.ambermd.org>
>>>><mailto:amber.ambermd.org>
>>>>
>>>> On Mon, Dec 29, 2014 at 5:33 AM, Fabian Glaser
>>>><fabian.glaser.gmail.com> <mailto:fabian.glaser.gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks Bill,
>>>>>
>>>>> We use PBS queuing, and despite it looks that all internal variables
>>>>>of
>>>>> the job seem to be ok (24 CPU from two different nodes) the same
>>>>>problem
>>>>> occurs, the job is run in only one node 24 subjobs ( 2 jobs x CPU)
>>>>>and the
>>>>> second one is empty, instead of starting 24 CPU in two nodes, here
>>>>>are the
>>>>> output variables once the job ends, and their real value produced
>>>>>by PBS
>>>>> output below. The PBS I use is also pasted. The amber output files
>>>>>by the
>>>>> way are produced just fine, the problem is it does not spread the
>>>>>jobs in
>>>>> more than one node.....
>>>>>
>>>>> So it seems the PBS queue is working correctly but something is not
>>>>> allowing it to use two nodes, do you still think the problem is on
>>>>>the
>>>>> system or you think we should recompile?
>>>>>
>>>>> Thanks a lot,
>>>>>
>>>>> Fabian
>>>>>
>>>>>
>>>>> • PBS_O_HOST - name of the host upon which qsub command is
>>>>>running
>>>>> • PBS_O_QUEUE - name of the original queue to which the job
>>>>>was
>>>>> submitted
>>>>> • PBS_O_WORKDIR - absolute path of the current working
>>>>>directory
>>>>> of the qsub command
>>>>> • PBS_ENVIRONMENT - set to PBS_BATCH to indicate the job is a
>>>>> batch job, or to PBS_INTERACTIVE to indicate the job is a PBS
>>>>>interactive
>>>>> job
>>>>> • PBS_JOBID - the job identifier assigned to the job by the
>>>>>batch
>>>>> system
>>>>> • PBS_JOBNAME - the job name supplied by the user
>>>>> • PBS_NODEFILE - the name of the file containing the list of
>>>>>nodes
>>>>> assigned to the job
>>>>> • PBS_QUEUE - the name of the queue from which the job is
>>>>>executed
>>>>>
>>>>> PBS output:
>>>>> ===
>>>>> -bash-4.1$ more ETA_1_min3.o1076308
>>>>> /u/fglaser/projects/IsraelVlodavsky/hep1_v2/MD/ETA_1/min
>>>>> tamnun.default.domain
>>>>> all_l_p
>>>>> /u/fglaser/projects/IsraelVlodavsky/hep1_v2/MD/ETA_1/min
>>>>> PBS_BATCH
>>>>> 1076308.tamnun
>>>>> ETA_1_min3
>>>>> all_l_p_exe
>>>>> nodes (24 cpu total):
>>>>> n032.default.domain
>>>>> n034.default.domain
>>>>>
>>>>> PBS file
>>>>> ======
>>>>>
>>>>> #!/bin/sh
>>>>> #
>>>>> # job name (default is the name of pbs script file)
>>>>> #---------------------------------------------------
>>>>> #PBS -N ETA_1_min3
>>>>> # Submit the job to the queue "queue_name"
>>>>> #---------------------------------------------------
>>>>> #PBS -q all_l_p
>>>>> # Send the mail messages (see below) to the specified user address
>>>>> #-----------------------------------------------------------------
>>>>> #PBS -M fglaser.technion.ac.il <mailto:fglaser.technion.ac.il>
>>>>> # send me mail when the job begins
>>>>> #---------------------------------------------------
>>>>> #PBS -mbea
>>>>> # resource limits: number and distribution of parallel processes
>>>>> #------------------------------------------------------------------
>>>>> #PBS -l select=2:ncpus=12:mpiprocs=12
>>>>> #
>>>>> # comment: this select statement means: use M chunks (nodes),
>>>>> # use N (=< 12) CPUs for N mpi tasks on each of M nodes.
>>>>> # "scatter" will use exactly N CPUs from each node, while omitting
>>>>> # "-l place" statement will fill all available CPUs of M nodes
>>>>> #
>>>>> # specifying working directory
>>>>> #------------------------------------------------------
>>>>> echo $PBS_O_WORKDIR
>>>>> echo $PBS_O_HOST
>>>>> echo $PBS_O_QUEUE
>>>>> echo $PBS_O_WORKDIR
>>>>> echo $PBS_ENVIRONMENT
>>>>> echo $PBS_JOBID
>>>>> echo $PBS_JOBNAME
>>>>> echo $PBS_QUEUE
>>>>>
>>>>>
>>>>> cd $PBS_O_WORKDIR
>>>>>
>>>>>
>>>>> # This finds out the number of nodes we have
>>>>> NP=$(wc -l $PBS_NODEFILE | awk '{print $1}')
>>>>> echo "nodes ($NP cpu total):"
>>>>> sort $PBS_NODEFILE | uniq
>>>>>
>>>>>
>>>>> export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:$LD_LIBRARY_PATH
>>>>> export PATH=/usr/lib64/openmpi/bin:$PATH
>>>>> source /usr/local/amber14/amber.sh
>>>>>
>>>>> #source /usr/local/amber14/setup.sh
>>>>>
>>>>> # running MPI executable with M*N processes
>>>>> #------------------------------------------------------
>>>>>
>>>>> mpirun -np 24 pmemd.MPI -O -i min.in -o min.out -p
>>>>> ../hep1_system_ETA_ETA_1.prmtop -c ../hep1_system_ETA_ETA_1.prmcrd -r
>>>>> min.rst -ref ../hep1_system_ETA_ETA_1.prmcrd
>>>>>
>>>> ​A number of people have already pointed out what they think is
>>>>happening
>>>> (and I agree with them): you are not giving any instruction here to
>>>>tell
>>>> the MPI implementation WHERE to actually run those 24 threads. In
>>>>some
>>>> cases (depending on how your MPI is installed), this will mean that
>>>>all 24
>>>> threads are run on the same node. If this is happening, you need to
>>>> provide a machinefile to mpirun to tell it exactly where to start all
>>>>of
>>>> the threads. In the case of OpenMPI, this can be done with
>>>>--hostfile; so
>>>> something like this:
>>>>
>>>> mpirun -np 24 --hostfile $PBS_NODEFILE pmemd.MPI -O ...
>>>>
>>>> ​That said​, the most common PBS implementation (Torque) provides an
>>>>API so
>>>> that applications can be made aware of the scheduler. The various MPI
>>>> implementations (OpenMPI, MPICH, etc.) can all be built with Torque
>>>> integration, which will make it *much* easier to use within the PBS
>>>> environment. For example. on one of the HPC systems I've used in the
>>>>past,
>>>> I was able to use the command:
>>>>
>>>> mpiexec pmemd.MPI -O ...
>>>>
>>>> with no arguments at all to mpiexec/mpirun -- in this case, mpiexec
>>>>was
>>>> able to figure out how many threads to run and where to run them
>>>>because it
>>>> was integrated directly with the scheduler.
>>>>
>>>> HTH,
>>>> Jason
>>>>
>>>> --
>>>> Jason M. Swails
>>>> BioMaPS,
>>>> Rutgers University
>>>> Postdoctoral Researcher
>>>> _______________________________________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org <mailto:AMBER.ambermd.org>
>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>><http://lists.ambermd.org/mailman/listinfo/amber>
>>
>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber



_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Dec 29 2014 - 08:30:02 PST
Custom Search