Re: parallel jobs die with no error message from sander

From: jim caldwell <caldwell_at_heimdal.compchem.ucsf.edu>
Date: Mon 29 Jul 2002 13:08:50 -0700 (PDT)

That looks like a communication problem to me. Are your machines
on a busy network? Can you isolate the compute nodes behind a
router/switch?

jim

On Mon, 29 Jul 2002, Joffre Heredia wrote:

>
> Same problem here using amber7/sander compiled with portland group.
>
> -------------------------------------------------------------
> Joffre Heredia Rodrigo Tel: (34)-93-5813812
> Laboratory of Computational Medicine Fax: (34)-93-5812344
> Biostatistic Dept.
> UAB School of Medicine. Bellaterra Joffre.Heredia_at_uab.es
> 08193-Barcelona (SPAIN)
> -------------------------------------------------------------
>
> On Mon, 29 Jul 2002 lamon_at_lav.Boehringer-Ingelheim.com wrote:
>
> >
> > > I am having the same problem with Amber7/sander as Mark Hsieh (see below)
> > > in which my jobs stop at random points for no obvious reason. My error
> > > output for a job running on four processors looks like the following:
> > >
> > > net_recv failed for fd = 8
> > > p0_25797: p4_error: net_recv read, errno = : 104
> > > bm_list_25798: p4_error: interrupt SIGINT: 2
> > > rm_l_3_31636: p4_error: interrupt SIGINT: 2
> > > p1_19028: p4_error: interrupt SIGINT: 2
> > > Broken pipe
> > > rm_l_2_7717: p4_error: interrupt SIGINT: 2
> > > p3_31635: p4_error: interrupt SIGINT: 2
> > > Broken pipe
> > > rm_l_1_19029: p4_error: interrupt SIGINT: 2
> > > Broken pipe
> > > /software/mpich-1.2.1/bin/mpirun: line 1: 25797 Broken pipe
> > > /software/amber7/exe_lnx_pll/sander "-i" "md.in" "-o" "md2.out" "-p"
> > > "md.parm" "-c" "md1.rst" "-r" "md2.rst" "-x" "md2.crd" "-O" -p4pg PI25665
> > > -p4wd .
> > >
> > >
> > > Here are my observations:
> > >
> > > 1. My jobs end unexpectedly with no error message from sander.
> > > 2. The problem is intermittent. If I run it one time, it might die after
> > > 2 ps and another time it will die after 50 ps.
> > > I have some systems in which it does not happen at all.
> > > 3. Output from sander stops several minutes before the job exits.
> > > 4. There are no system error messages indicating a hardware failure.
> > > 5. It only happens with jobs running on more than one processor.
> > > 6. I have tried it on two different linux clusters (compiled with g77)
> > > running two different versions of mpich (1.2.1 and 1.2.4).
> > > 7. A job run with the same input parameters and restart file but using
> > > Amber6/sander will run with no problems.
> > >
> > > Any suggestions?
> > >
> > > Thanks,
> > > Lynn
> > >
> > >
> > >
> > > Dr. Lynn Amon
> > > Research Scientist
> > > Boehringer-Ingelheim (Canada) Ltd.
> > > 2100 Cunard Street
> > > Laval (Quebec) Canada H7S 2G5
> > > (450) 682-4640
> > > lamon_at_lav.boehringer-ingelheim.com
> > >
> > >
> > >
> > >
> > > From: "Mark Hsieh" <hsieh_at_abmaxis.com
> > > <mailto:hsieh_at_abmaxis.com?subject=Re:%20mpirun/sander%20problem&replyto=MP
> > > EKLFFCDNOPEMCNBJCGMEPFCAAA.hsieh_at_abmaxis.com>>
> > > Subject: mpirun/sander problem
> > > Date: Mon, 6 May 2002 13:00:32 -0700
> > > Message-ID: <MPEKLFFCDNOPEMCNBJCGMEPFCAAA.hsieh_at_abmaxis.com>
> > >
> > > Hi,
> > > For some reason, my mpirun/sander molecular dynamics simulations are
> > > stopping at random time points with the following errors:
> > > p0_4155: (11784.142298) net_recv failed for fd = 6
> > > p0_4155: p4_error: net_recv read, errno = : 110
> > > p3_2900: (11783.828019) net_recv failed for fd = 6
> > > /disk1/app/mpich-1.2.3/bin/mpirun: line 1: 4155 Broken pipe
> > > /disk1/app/amber7/exe/sander "-O" "-i" "md7.in" "-o" "md7.out" "-p"
> > > "prmtop"
> > > "-c" "min7.rst" "-r" "md7.rst" "-x" "md7.mdcrd" "-ref" "min7.rst" "-inf"
> > > "md7.mdinfo" -p4pg pgfile4 -p4wd /disk1/hsieh.tmp/amber.runs/
> > > P4 procgroup file is pgfile4.
> > > p3_2900: p4_error: net_recv read, errno = : 104
> > > pgfile4 calls up two dual PIII workstations:
> > > tiger 0 /disk1/app/amber7/exe/sander
> > > tiger 1 /disk1/app/amber7/exe/sander
> > > cow2 1 /disk1/app/amber7/exe/sander
> > > cow2 1 /disk1/app/amber7/exe/sander
> > > Three identical run produced mdinfo files that indicated 4.1, 19.1 and 7.1
> > >
> > > ps as their last update.
> > > Thank you,
> > > Mark
> > >
> > >
> >
>


----------------------------------------------------------------------------
James W. Caldwell (voice) 415-476-8603
Department of Pharmaceutical Chemistry (fax) 415-502-1411
Mail Stop 0446 (email) caldwell_at_heimdal.ucsf.edu
513 Parnassus Avenue
University of California
San Francisco, CA 94143-0446
----------------------------------------------------------------------------
Received on Mon Jul 29 2002 - 13:08:50 PDT
Custom Search