Re: parallel jobs die with no error message from sander from Joffre Heredia on 2002-07-29 (Amber Archive Jul 2002)

From: Joffre Heredia <joffre_at_yogi.uab.es>
Date: Mon 29 Jul 2002 21:06:33 -0700

Same problem here using amber7/sander compiled with portland group.

-------------------------------------------------------------
Joffre Heredia Rodrigo Tel: (34)-93-5813812
Laboratory of Computational Medicine Fax: (34)-93-5812344
Biostatistic Dept.
UAB School of Medicine. Bellaterra Joffre.Heredia_at_uab.es
08193-Barcelona (SPAIN)
-------------------------------------------------------------

On Mon, 29 Jul 2002 lamon_at_lav.Boehringer-Ingelheim.com wrote:

>
> > I am having the same problem with Amber7/sander as Mark Hsieh (see below)
> > in which my jobs stop at random points for no obvious reason. My error
> > output for a job running on four processors looks like the following:
> >
> > net_recv failed for fd = 8
> > p0_25797: p4_error: net_recv read, errno = : 104
> > bm_list_25798: p4_error: interrupt SIGINT: 2
> > rm_l_3_31636: p4_error: interrupt SIGINT: 2
> > p1_19028: p4_error: interrupt SIGINT: 2
> > Broken pipe
> > rm_l_2_7717: p4_error: interrupt SIGINT: 2
> > p3_31635: p4_error: interrupt SIGINT: 2
> > Broken pipe
> > rm_l_1_19029: p4_error: interrupt SIGINT: 2
> > Broken pipe
> > /software/mpich-1.2.1/bin/mpirun: line 1: 25797 Broken pipe
> > /software/amber7/exe_lnx_pll/sander "-i" "md.in" "-o" "md2.out" "-p"
> > "md.parm" "-c" "md1.rst" "-r" "md2.rst" "-x" "md2.crd" "-O" -p4pg PI25665
> > -p4wd .
> >
> >
> > Here are my observations:
> >
> > 1. My jobs end unexpectedly with no error message from sander.
> > 2. The problem is intermittent. If I run it one time, it might die after
> > 2 ps and another time it will die after 50 ps.
> > I have some systems in which it does not happen at all.
> > 3. Output from sander stops several minutes before the job exits.
> > 4. There are no system error messages indicating a hardware failure.
> > 5. It only happens with jobs running on more than one processor.
> > 6. I have tried it on two different linux clusters (compiled with g77)
> > running two different versions of mpich (1.2.1 and 1.2.4).
> > 7. A job run with the same input parameters and restart file but using
> > Amber6/sander will run with no problems.
> >
> > Any suggestions?
> >
> > Thanks,
> > Lynn
> >
> >
> >
> > Dr. Lynn Amon
> > Research Scientist
> > Boehringer-Ingelheim (Canada) Ltd.
> > 2100 Cunard Street
> > Laval (Quebec) Canada H7S 2G5
> > (450) 682-4640
> > lamon_at_lav.boehringer-ingelheim.com
> >
> >
> >
> >
> > From: "Mark Hsieh" <hsieh_at_abmaxis.com
> > <mailto:hsieh_at_abmaxis.com?subject=Re:%20mpirun/sander%20problem&replyto=MP
> > EKLFFCDNOPEMCNBJCGMEPFCAAA.hsieh_at_abmaxis.com>>
> > Subject: mpirun/sander problem
> > Date: Mon, 6 May 2002 13:00:32 -0700
> > Message-ID: <MPEKLFFCDNOPEMCNBJCGMEPFCAAA.hsieh_at_abmaxis.com>
> >
> > Hi,
> > For some reason, my mpirun/sander molecular dynamics simulations are
> > stopping at random time points with the following errors:
> > p0_4155: (11784.142298) net_recv failed for fd = 6
> > p0_4155: p4_error: net_recv read, errno = : 110
> > p3_2900: (11783.828019) net_recv failed for fd = 6
> > /disk1/app/mpich-1.2.3/bin/mpirun: line 1: 4155 Broken pipe
> > /disk1/app/amber7/exe/sander "-O" "-i" "md7.in" "-o" "md7.out" "-p"
> > "prmtop"
> > "-c" "min7.rst" "-r" "md7.rst" "-x" "md7.mdcrd" "-ref" "min7.rst" "-inf"
> > "md7.mdinfo" -p4pg pgfile4 -p4wd /disk1/hsieh.tmp/amber.runs/
> > P4 procgroup file is pgfile4.
> > p3_2900: p4_error: net_recv read, errno = : 104
> > pgfile4 calls up two dual PIII workstations:
> > tiger 0 /disk1/app/amber7/exe/sander
> > tiger 1 /disk1/app/amber7/exe/sander
> > cow2 1 /disk1/app/amber7/exe/sander
> > cow2 1 /disk1/app/amber7/exe/sander
> > Three identical run produced mdinfo files that indicated 4.1, 19.1 and 7.1
> >
> > ps as their last update.
> > Thank you,
> > Mark
> >
> >
>
Received on Mon Jul 29 2002 - 21:06:33 PDT