Re: [AMBER] Re: mpirun noticed that process rank 1 ... on signal 1 (Hangup). from Naser Alijabbari on 2009-07-14 (Amber Archive Jul 2009)

From: Naser Alijabbari <na3m.virginia.edu>
Date: Tue, 14 Jul 2009 17:30:55 +0100

Dr. Walker thanks for the comprehensive reply.

>
> This is most likely a problem with your hardware. Does this happen only on
> this simulation or with different simulations as well?
>

I have not tried different proteins or systems, just what I was using in the
old computer/laptop

>
> Are you running on a single node here or across multiple nodes? In the
> later
> case you should check your cabling, run the MPI bandwidth and latency
> tests,
> run some MPI and hardware stress tests to make sure the cables and switch
> are behaving themselves. If you are on a single node I would do some stress
> / memory tests on it to make sure that is good.

I am running on a single quad core (or one node or one computer) but I am
not quite sure how to do stress/memory test that would be similar to Amber
(i didn't see anything in ambermd.org.)

>
>
> Also check that you haven't got some kind of shell timeout set for your
> machine, often over zealous admins set things like 1 hour limits on
> interactive commands. Are you running this in a queueing system? In which
> case are you requesting enough wallclock time?

Not using a queueing system but logging ddirectly through ssh
in my profile I set 'export TMOUT=0'

>
>
> Is your stack set to unlimited in your shell?

using 'ulimit -s' returned '10240' so I typed 'ulimit -s unlimited' in
terminal and set that variable in my profile

>
> Does your machine have enough memory?

2GB was enough for this system before, however 'top' shows

top - 11:15:04 up 3 days, 19:15, 3 users, load average: 4.00, 4.00, 4.00
Tasks: 176 total, 5 running, 163 sleeping, 8 stopped, 0 zombie
Cpu(s): 99.9%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.1%hi, 0.0%si,
0.0%st
Mem: 2051512k total, 2035252k used, 16260k free, 166276k buffers
Swap: 2031608k total, 4k used, 2031604k free, 1141988k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

5249 guest 20 0 132m 39m 3424 R 100.1 2.0 262:47.25
sander
5250 guest 20 0 132m 39m 3424 R 99.8 2.0 262:47.47
sander
5251 guest 20 0 132m 39m 3424 R 99.8 2.0 263:07.83
sander
5252 guest 20 0 132m 39m 3424 R 99.8 2.0 262:51.78
sander

So maybe I don't have enough memory (since only 16260k is free). I always
looked at '%MEM' and was happy that only 2% was used at anytime. Or maybe I
don't understand what is going on here.

>
> Are other things running on it at the same time?

No

>
> Are you running the command with 'nohup' so it isn't killed when you close
> the terminal?

Yes my commands are 'nohup ... &'

>
>
> Is some 'nice' person just trying to mess with you by logging in as root
> and
> randomly killing your jobs?

That would be mean

>
>
> Short answer is that it is most likely a hardware / user environment issue
> than an AMBER issue. Especially if it happens on multiple different
> simulations.
>
> All the best
> Ross

>
> > -----Original Message-----
> > From: amber-bounces.ambermd.org [mailto:amber-bounces.ambermd.org] On
> > Behalf Of Naser Alijabbari
> > Sent: Tuesday, July 14, 2009 3:49 AM
> > To: AMBER Mailing List
> > Subject: [AMBER] Re: mpirun noticed that process rank 1 ... on signal 1
> > (Hangup).
> >
> > sorry the message was cut. However, when I use the same configuration
> > for a
> > new computer: Systemax model 981091 - Intel core quadI get the
> > following
> > error at non reproducible intervals:
> > NSTEP = 32000 TIME(PS) = 186.201 TEMP(K) = 289.61 PRESS =
> > 0.0
> > Etot = -28480.7941 EKtot = 7231.1811 EPtot =
> > -35711.9752
> > BOND = 286.8982 ANGLE = 774.4197 DIHED =
> > 1115.8723
> > 1-4 NB = 373.9270 1-4 EEL = 5943.6287 VDWAALS =
> > 3888.4780
> > EELEC = -48095.1992 EHBOND = 0.0000 RESTRAINT =
> > 0.0000
> > Ewald error estimate: 0.3488E-03
> > ----------------------------------------------------------------------
> > --------
> >
> > NSTEP = 33000 TIME(PS) = 187.201 TEMP(K) = 289.75 PRESS =
> > 0.0
> > Etot = -28482.0763 EKtot = 7234.8307 EPtot =
> > -35716.9070
> > BOND = 273.0706 ANGLE = 770.8814 DIHED =
> > 1103.7325
> > 1-4 NB = 365.0504 1-4 EEL = 5948.9635 VDWAALS =
> > 3906.6602
> > EELEC = -48085.2656 EHBOND = 0.0000 RESTRAINT =
> > 0.0000
> > Ewald error estimate: 0.3295E-04
> > ----------------------------------------------------------------------
> > --------
> >
> > NSTEP = 34000 TIME(PS) = 188.201 TEMP(K) = 292.64 PRESS =
> > 0.0
> > Etot = -28482.3229 EKtot = 7306.8466 EPtot =
> > -35789.1694
> > BOND = 298.3194 ANGLE = 773.0516 DIHED =
> > 1106.2032
> > 1-4 NB = 378.5118 1-4 EEL = 5980.9467 VDWAALS =
> > 4141.3781
> > EELEC = -48467.5802 EHBOND = 0.0000 RESTRAINT =
> > 0.0000
> > Ewald error estimate: 0.6165E-05
> > ----------------------------------------------------------------------
> > --------
> >
> >
> > ==> nohup.out <==
> > -----------------------------------------------------------------------
> > ---
> > mpirun noticed that process rank 1 with PID 31400 on node
> > xxx.xxx.xx.xxx
> > exited on signal 1 (Hangup).
> > -----------------------------------------------------------------------
> > ---
> > 2 total processes killed (some possibly by mpirun during cleanup)
> >
> > I have even run a simulation that was 200000 step without a hangup but
> > the
> > problem sometimes randomly appears. I believe it is tied to me leaving
> > the
> > ssh terminal whenever the error does occur. I am using fedora 9.
> > Has anyone else seen this before?
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Jul 14 2009 - 10:12:24 PDT