Re: [AMBER] mpi with openmpi - SGE error

From: Yan Gao <yan.gao.2001.gmail.com>
Date: Tue, 27 Jul 2010 19:47:05 -0700

*Hi ,

I did several mpi tests. The tests with 2/4/8 nodes succeeded, and those
with 16/32 nodes failed with new errors.*
*I re-outline my scripts here:*
*************************************************************************
#!/bin/csh -f
#$ -cwd
#$ -N sim.MD
#$ -S /bin/tcsh
#$ -l h_rt=120:00:00
#$ -e SGE.err
#$ -o SGE.out
#$ -pe mpich 16

set INTELF=/nas/y1gao/soft/intel-11.1.072
set INTELC=/nas/y1gao/soft/intel-c-11.1.072
set MPI_HOME=/home/y1gao/soft/openmpi-1.4.2
set GROHOME=/home/y1gao/soft/gromacs-4.0.5
set GSL=/home/y1gao/soft/gsl-1.9
set AMBERHOME=/nas/y1gao/soft/amber10
set AMBERTOOL=/nas/y1gao/soft/amber11/AmberTools

setenv PATH
$INTELF/bin:$INTELC/bin:$GROHOME/bin:$MPI_HOME/bin:$AMBERHOME/exe:$AMBERTOOL/exe:$PATH
setenv LD_LIBRARY_PATH
$INTELF/lib/ia32:$INTELC/lib/ia32:$INTELF/idb/lib/ia32:$INTELC/idb/lib/ia32:$GSL/lib:$MPI_HOME/lib:$LD_LIBRARY_PATH
# the $MPI_HOME/lib path is needed.

## Checking setenvm, mpi and sander
echo
"######################################################################################################"
which mpirun
ldd $MPI_HOME/bin/mpirun
which sander.MPI
ldd $AMBERHOME/bin/sander.MPI
echo
"######################################################################################################"

$MPI_HOME/bin/mpirun -np 16 $AMBERHOME/bin/sander.MPI -O -i test.in -o
test.out -c test.rst -p test.prmtop -r test1.rst
*************************************************************************
*Then I got the output:*
*******************************SGE.err******************************************
error: commlib error: can't connect to service (No route to host)
error: executing task of job 426907 failed: failed sending task to
execd.compute-0-22.local: can't find connection
--------------------------------------------------------------------------
A daemon (pid 14019) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
mpirun: clean termination accomplished

[compute-2-13.local:15452] [[40282,0],4] routed:binomial: Connection to
lifeline [[40282,0],0] lost
[compute-0-16.local:23274] [[40282,0],3] routed:binomial: Connection to
lifeline [[40282,0],0] lost
[compute-2-15.local:14638] [[40282,0],7] routed:binomial: Connection to
lifeline [[40282,0],0] lost
[compute-1-13.local:22080] [[40282,0],5] routed:binomial: Connection to
lifeline [[40282,0],0] lost
[compute-0-27.local:21011] [[40282,0],6] routed:binomial: Connection to
lifeline [[40282,0],0] lost
[compute-3-0.local:20138] [[40282,0],2] routed:binomial: Connection to
lifeline [[40282,0],0] lost

*************************************************************************
*and *
********************************SGE.out*****************************************
-catch_rsh
/opt/gridengine/default/spool/compute-1-14/active_jobs/426907.1/pe_hostfile
compute-1-14
compute-1-14
compute-0-22
compute-0-22
compute-3-0
compute-3-0
compute-0-16
compute-0-16
compute-2-13
compute-2-13
compute-1-13
compute-1-13
compute-0-27
compute-0-27
compute-2-15
compute-2-15
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
######################################################################################################
/home/y1gao/soft/openmpi-1.4.2/bin/mpirun
    libopen-rte.so.0 => /home/y1gao/soft/openmpi-1.4.2/lib/libopen-rte.so.0
(0x40001000)
    libopen-pal.so.0 => /home/y1gao/soft/openmpi-1.4.2/lib/libopen-pal.so.0
(0x40078000)
    libnuma.so.1 => /usr/lib/libnuma.so.1 (0x00395000)
    libdl.so.2 => /lib/libdl.so.2 (0x400d7000)
    libnsl.so.1 => /lib/libnsl.so.1 (0x00412000)
    libutil.so.1 => /lib/libutil.so.1 (0x003dc000)
    libm.so.6 => /lib/tls/libm.so.6 (0x0039b000)
    libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x003d2000)
    libpthread.so.0 => /lib/tls/libpthread.so.0 (0x004b3000)
    libc.so.6 => /lib/tls/libc.so.6 (0x00268000)
    libimf.so => /nas/y1gao/soft/intel-11.1.072/lib/ia32/libimf.so
(0x400dc000)
    libsvml.so => /nas/y1gao/soft/intel-11.1.072/lib/ia32/libsvml.so
(0x40341000)
    libintlc.so.5 => /nas/y1gao/soft/intel-11.1.072/lib/ia32/libintlc.so.5
(0x4046c000)
    /lib/ld-linux.so.2 (0x0024a000)
/nas/y1gao/soft/amber10/exe/sander.MPI
    libsvml.so => /nas/y1gao/soft/intel-11.1.072/lib/ia32/libsvml.so
(0x40001000)
    libmpi_f90.so.0 => /home/y1gao/soft/openmpi-1.4.2/lib/libmpi_f90.so.0
(0x4012b000)
    libmpi_f77.so.0 => /home/y1gao/soft/openmpi-1.4.2/lib/libmpi_f77.so.0
(0x4012e000)
    libmpi.so.0 => /home/y1gao/soft/openmpi-1.4.2/lib/libmpi.so.0
(0x40154000)
    libopen-rte.so.0 => /home/y1gao/soft/openmpi-1.4.2/lib/libopen-rte.so.0
(0x40305000)
    libopen-pal.so.0 => /home/y1gao/soft/openmpi-1.4.2/lib/libopen-pal.so.0
(0x4037d000)
    libnuma.so.1 => /usr/lib/libnuma.so.1 (0x00395000)
    libdl.so.2 => /lib/libdl.so.2 (0x403dc000)
    libnsl.so.1 => /lib/libnsl.so.1 (0x00412000)
    libutil.so.1 => /lib/libutil.so.1 (0x003dc000)
    libm.so.6 => /lib/tls/libm.so.6 (0x0039b000)
    libpthread.so.0 => /lib/tls/libpthread.so.0 (0x004b3000)
    libc.so.6 => /lib/tls/libc.so.6 (0x00268000)
    libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x003d2000)
    libifport.so.5 => /nas/y1gao/soft/intel-11.1.072/lib/ia32/libifport.so.5
(0x403e1000)
    libifcoremt.so.5 =>
/nas/y1gao/soft/intel-11.1.072/lib/ia32/libifcoremt.so.5 (0x40401000)
    libimf.so => /nas/y1gao/soft/intel-11.1.072/lib/ia32/libimf.so
(0x40511000)
    libintlc.so.5 => /nas/y1gao/soft/intel-11.1.072/lib/ia32/libintlc.so.5
(0x40776000)
    /lib/ld-linux.so.2 (0x0024a000)
######################################################################################################

*************************************************************************

Thanks!

Regards,
Yan

On Tue, Jul 27, 2010 at 12:15 PM, Yan Gao <yan.gao.2001.gmail.com> wrote:

> Dear Jason,
>
> Thank you for your comments and help!
> I followed your comments and corrected the small discrepancies.
>
> The passphrase method still did not work, so I turned to the public key,
> and it seemed worked.
>
> Thanks!
> Yan
>
>
> On Tue, Jul 27, 2010 at 10:57 AM, Jason Swails <jason.swails.gmail.com>wrote:
>
>> Hello,
>>
>> I'm not sure how this would affect your results (if at all), but AMBERTOOL
>> should be set to /nas/y1gao/soft/amber11/AmberTools/ (unless of course you
>> changed it, but I think you should avoid that).
>>
>> On Mon, Jul 26, 2010 at 9:45 PM, Yan Gao <yan.gao.2001.gmail.com> wrote:
>>
>> > Thanks Bill. I include my input command with this message.
>> >
>> > Hi there,
>> >
>> > I tried to run amber with openmpi on a unix system. I used below input
>> > with
>> > command: qsub
>> >
>> >
>> *****************************************************************************************************************
>> > #!/bin/csh -f
>> > #$ -cwd
>> > #$ -N sim.MD
>> > #$ -S /bin/tcsh
>> > #$ -l h_rt=120:00:00
>> > #$ -e SGE.err
>> > #$ -o SGE.out
>> > #$ -pe mpich 4
>> >
>> > set INTELF=/nas/y1gao/soft/intel-11.1.072
>> > set INTELC=/nas/y1gao/soft/intel-c-11.1.072
>> > set MPI_HOME=/home/y1gao/soft/openmpi-1.4.2
>> > set GROHOME=/home/y1gao/soft/gromacs-4.0.5
>> > set GSL=/home/y1gao/soft/gsl-1.9
>> > set AMBERHOME=/nas/y1gao/soft/amber10
>> > set AMBERTOOL=/nas/y1gao/soft/amber11/AmberTool
>> >
>> > set SANDEREXEC=$AMBERHOME/bin/sander
>> > set LEAPEXEC=$AMBERTOOL/bin/tleap
>> >
>> > setenv PATH
>> >
>> >
>> $INTELF/bin:$INTELC/bin:$GROHOME/bin:$MPI_HOME/bin:$AMBERHOME/exe:$AMBERTOOL/exe:$PATH
>> > setenv LD_LIBRARY_PATH
>> >
>> >
>> $INTELF/lib/ia32:$INTELC/lib/ia32:$INTELF/idb/lib/ia32:$INTELC/idb/lib/ia32:$MPI_HOME/lib:$GSL/lib:$AMBERTOOL/lib:$LD_LIBRARY_PATH
>> >
>>
>> These environment variables should be set just fine if you source
>> /nas/y1gao/soft/intel-11.1.072/bin/ifortvars.csh and the corresponding
>> iccvars.csh. Also, you should not have to load $MPI_HOME/lib into your
>> LD_LIBRARY_PATH. Same with $AMBERTOOL/lib (that's not needed in
>> LD_LIBRARY_PATH).
>>
>>
>> >
>> > ## Checking setenvm, mpi and sander
>> > echo
>> >
>> >
>> "######################################################################################################"
>> > which mpirun
>> > ldd /home/y1gao/soft/openmpi-1.4.2/bin/mpirun
>> >
>>
>> Why not use $MPI_HOME/bin/mpirun, just to be consistent...
>>
>>
>> > which sander.MPI
>> > ldd /nas/y1gao/soft/amber10/bin/sander.MPI
>> >
>>
>> same with $AMBERHOME/bin.
>>
>>
>> > echo
>> >
>> >
>> "######################################################################################################"
>> >
>> > mpirun -np 4 sander.MPI -O -i test.in -o test.out -c test.rst -p
>> > test.prmtop
>> > -r test.rst
>> >
>>
>> You should specify mpirun and sander.MPI the same way you did above, with
>> $MPI_HOME and $AMBERHOME, I think. Consistency helps make sure you're
>> checking everything properly.
>>
>>
>> >
>> >
>> >
>> >
>> ***************************************************************************************************************************************
>> >
>> >
>> >
>> > I got below errors when I did a trial:
>> >
>> >
>> >
>> >
>> *******************************************SGE.err**************************************************************************
>> > Permission denied, please try again.
>> > Permission denied, please try again.
>> > Permission denied (publickey,gssapi-with-mic,password).
>> >
>>
>> This is not an amber issue, this has to do with the fact that you don't
>> have
>> permissions with your ssh key on the nodes you're trying to run on. If
>> you
>> have a system admin you can talk to about this, they would probably help
>> more.
>>
>> Good luck!
>> Jason
>>
>>
>> >
>> --------------------------------------------------------------------------
>> > A daemon (pid 17525) died unexpectedly with status 129 while attempting
>> > to launch so we are aborting.
>> >
>> > There may be more information reported by the environment (see above).
>> >
>> > This may be because the daemon was unable to find all the needed shared
>> > libraries on the remote node. You may set your LD_LIBRARY_PATH to have
>> the
>> > location of the shared libraries on the remote nodes and this will
>> > automatically be forwarded to the remote nodes.
>> >
>> --------------------------------------------------------------------------
>> >
>> --------------------------------------------------------------------------
>> > mpirun noticed that the job aborted, but has no info as to the process
>> > that caused that situation.
>> >
>> --------------------------------------------------------------------------
>> > mpirun: clean termination accomplished
>> >
>> >
>> >
>> *****************************************************SGE.out***********************************************************
>> > -catch_rsh
>> >
>> /opt/gridengine/default/spool/compute-0-19/active_jobs/426880.1/pe_hostfile
>> > compute-0-19
>> > compute-0-19
>> > compute-0-18
>> > compute-0-18
>> > Warning: no access to tty (Bad file descriptor).
>> > Thus no job control in this shell.
>> >
>> >
>> ######################################################################################################
>> > /home/y1gao/soft/openmpi-1.4.2/bin/mpirun
>> > libopen-rte.so.0 =>
>> /home/y1gao/soft/openmpi-1.4.2/lib/libopen-rte.so.0
>> > (0x40001000)
>> > libopen-pal.so.0 =>
>> /home/y1gao/soft/openmpi-1.4.2/lib/libopen-pal.so.0
>> > (0x40078000)
>> > libnuma.so.1 => /usr/lib/libnuma.so.1 (0x0077c000)
>> > libdl.so.2 => /lib/libdl.so.2 (0x400d7000)
>> > libnsl.so.1 => /lib/libnsl.so.1 (0x0080e000)
>> > libutil.so.1 => /lib/libutil.so.1 (0x007c3000)
>> > libm.so.6 => /lib/tls/libm.so.6 (0x00782000)
>> > libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x007b9000)
>> > libpthread.so.0 => /lib/tls/libpthread.so.0 (0x0089a000)
>> > libc.so.6 => /lib/tls/libc.so.6 (0x0064f000)
>> > libimf.so => /nas/y1gao/soft/intel-11.1.072/lib/ia32/libimf.so
>> > (0x400dc000)
>> > libsvml.so => /nas/y1gao/soft/intel-11.1.072/lib/ia32/libsvml.so
>> > (0x40341000)
>> > libintlc.so.5 =>
>> /nas/y1gao/soft/intel-11.1.072/lib/ia32/libintlc.so.5
>> > (0x4046c000)
>> > /lib/ld-linux.so.2 (0x00631000)
>> > /nas/y1gao/soft/amber10/exe/sander.MPI
>> > libsvml.so => /nas/y1gao/soft/intel-11.1.072/lib/ia32/libsvml.so
>> > (0x40001000)
>> > libmpi_f90.so.0 => /home/y1gao/soft/openmpi-1.4.2/lib/libmpi_f90.so.0
>> > (0x4012b000)
>> > libmpi_f77.so.0 => /home/y1gao/soft/openmpi-1.4.2/lib/libmpi_f77.so.0
>> > (0x4012e000)
>> > libmpi.so.0 => /home/y1gao/soft/openmpi-1.4.2/lib/libmpi.so.0
>> > (0x40154000)
>> > libopen-rte.so.0 =>
>> /home/y1gao/soft/openmpi-1.4.2/lib/libopen-rte.so.0
>> > (0x40305000)
>> > libopen-pal.so.0 =>
>> /home/y1gao/soft/openmpi-1.4.2/lib/libopen-pal.so.0
>> > (0x4037d000)
>> > libnuma.so.1 => /usr/lib/libnuma.so.1 (0x0077c000)
>> > libdl.so.2 => /lib/libdl.so.2 (0x403dc000)
>> > libnsl.so.1 => /lib/libnsl.so.1 (0x0080e000)
>> > libutil.so.1 => /lib/libutil.so.1 (0x007c3000)
>> > libm.so.6 => /lib/tls/libm.so.6 (0x00782000)
>> > libpthread.so.0 => /lib/tls/libpthread.so.0 (0x0089a000)
>> > libc.so.6 => /lib/tls/libc.so.6 (0x0064f000)
>> > libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x007b9000)
>> > libifport.so.5 =>
>> /nas/y1gao/soft/intel-11.1.072/lib/ia32/libifport.so.5
>> > (0x403e1000)
>> > libifcoremt.so.5 =>
>> > /nas/y1gao/soft/intel-11.1.072/lib/ia32/libifcoremt.so.5 (0x40401000)
>> > libimf.so => /nas/y1gao/soft/intel-11.1.072/lib/ia32/libimf.so
>> > (0x40511000)
>> > libintlc.so.5 =>
>> /nas/y1gao/soft/intel-11.1.072/lib/ia32/libintlc.so.5
>> > (0x40776000)
>> > /lib/ld-linux.so.2 (0x00631000)
>> >
>> >
>> ######################################################################################################
>> >
>> >
>> >
>> >
>> *****************************************************************************************************************
>> >
>> > I then google "*Permission denied
>> (publickey,gssapi-with-mic,password)*",
>> > and setup/cp the passphrase. So I can automatically log onto a node
>> without
>> > inputting the password/passphrase manually.
>> > Then I tried again with mpi, and got the same output. I am kind of stuck
>> > here, could anyone help me. Thanks!
>> >
>> > Regards,
>> > --
>> > Yan Gao
>> > Jacobs School of Engineering
>> > University of California, San Diego
>> > Tel: 858-952-2308
>> > Email: Yan.Gao.2001.gmail.com
>> > _______________________________________________
>> > AMBER mailing list
>> > AMBER.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber
>> >
>>
>>
>>
>> --
>> Jason M. Swails
>> Quantum Theory Project,
>> University of Florida
>> Ph.D. Graduate Student
>> 352-392-4032
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>
>
>
> --
> Yan Gao
> Jacobs School of Engineering
> University of California, San Diego
> Tel: 858-952-2308
> Email: Yan.Gao.2001.gmail.com
>



-- 
Yan Gao
Jacobs School of Engineering
University of California, San Diego
Tel: 858-952-2308
Email: Yan.Gao.2001.gmail.com
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Jul 27 2010 - 20:00:03 PDT
Custom Search