RE: [AMBER] Re: mpirun noticed that process rank 1 ... on signal 1 (Hangup).

From: Ross Walker <ross.rosswalker.co.uk>
Date: Tue, 14 Jul 2009 13:33:49 +0100

Hi Naser,

This is most likely a problem with your hardware. Does this happen only on
this simulation or with different simulations as well?

Are you running on a single node here or across multiple nodes? In the later
case you should check your cabling, run the MPI bandwidth and latency tests,
run some MPI and hardware stress tests to make sure the cables and switch
are behaving themselves. If you are on a single node I would do some stress
/ memory tests on it to make sure that is good.

Also check that you haven't got some kind of shell timeout set for your
machine, often over zealous admins set things like 1 hour limits on
interactive commands. Are you running this in a queueing system? In which
case are you requesting enough wallclock time?

Is your stack set to unlimited in your shell?
Does your machine have enough memory?
Are other things running on it at the same time?
Are you running the command with 'nohup' so it isn't killed when you close
the terminal?

Is some 'nice' person just trying to mess with you by logging in as root and
randomly killing your jobs?

Short answer is that it is most likely a hardware / user environment issue
than an AMBER issue. Especially if it happens on multiple different
simulations.

All the best
Ross

> -----Original Message-----
> From: amber-bounces.ambermd.org [mailto:amber-bounces.ambermd.org] On
> Behalf Of Naser Alijabbari
> Sent: Tuesday, July 14, 2009 3:49 AM
> To: AMBER Mailing List
> Subject: [AMBER] Re: mpirun noticed that process rank 1 ... on signal 1
> (Hangup).
>
> sorry the message was cut. However, when I use the same configuration
> for a
> new computer: Systemax model 981091 - Intel core quadI get the
> following
> error at non reproducible intervals:
> NSTEP = 32000 TIME(PS) = 186.201 TEMP(K) = 289.61 PRESS =
> 0.0
> Etot = -28480.7941 EKtot = 7231.1811 EPtot =
> -35711.9752
> BOND = 286.8982 ANGLE = 774.4197 DIHED =
> 1115.8723
> 1-4 NB = 373.9270 1-4 EEL = 5943.6287 VDWAALS =
> 3888.4780
> EELEC = -48095.1992 EHBOND = 0.0000 RESTRAINT =
> 0.0000
> Ewald error estimate: 0.3488E-03
> ----------------------------------------------------------------------
> --------
>
> NSTEP = 33000 TIME(PS) = 187.201 TEMP(K) = 289.75 PRESS =
> 0.0
> Etot = -28482.0763 EKtot = 7234.8307 EPtot =
> -35716.9070
> BOND = 273.0706 ANGLE = 770.8814 DIHED =
> 1103.7325
> 1-4 NB = 365.0504 1-4 EEL = 5948.9635 VDWAALS =
> 3906.6602
> EELEC = -48085.2656 EHBOND = 0.0000 RESTRAINT =
> 0.0000
> Ewald error estimate: 0.3295E-04
> ----------------------------------------------------------------------
> --------
>
> NSTEP = 34000 TIME(PS) = 188.201 TEMP(K) = 292.64 PRESS =
> 0.0
> Etot = -28482.3229 EKtot = 7306.8466 EPtot =
> -35789.1694
> BOND = 298.3194 ANGLE = 773.0516 DIHED =
> 1106.2032
> 1-4 NB = 378.5118 1-4 EEL = 5980.9467 VDWAALS =
> 4141.3781
> EELEC = -48467.5802 EHBOND = 0.0000 RESTRAINT =
> 0.0000
> Ewald error estimate: 0.6165E-05
> ----------------------------------------------------------------------
> --------
>
>
> ==> nohup.out <==
> -----------------------------------------------------------------------
> ---
> mpirun noticed that process rank 1 with PID 31400 on node
> xxx.xxx.xx.xxx
> exited on signal 1 (Hangup).
> -----------------------------------------------------------------------
> ---
> 2 total processes killed (some possibly by mpirun during cleanup)
>
> I have even run a simulation that was 200000 step without a hangup but
> the
> problem sometimes randomly appears. I believe it is tied to me leaving
> the
> ssh terminal whenever the error does occur. I am using fedora 9.
> Has anyone else seen this before?
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber


_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Jul 14 2009 - 10:09:13 PDT
Custom Search