Re: [AMBER] hpc error from Damiano Spadoni on 2015-09-03 (Amber Archive Sep 2015)

From: Damiano Spadoni <enxds6.nottingham.ac.uk>
Date: Thu, 3 Sep 2015 10:16:19 +0000

Dear AMBER creators,

I'm trying to run the following script to simulate a heating step of my protein:
#!/bin/bash --login

#PBS -N PLP_heating
#PBS -l walltime=24:00:00
#PBS -l select=2
#PBS -j oe
#PBS -A e280-Croft

module load amber

cd /work/e280/e280/enxds6/PLP

export MPICH_FAST_MEMCPY=1
export MPI_COL_OPT_ON=1

aprun -n 48 pmemd.MPI -O -i LAMPLP_heat1.in -o LAMPLP_heat1.out -p LAMPLP.prmtop -c LAMPLP_min2.rst -r LAMPLP_heat1.rst -ref LAMPLP.inpcrd -x LAMPLP_heat1.mdcrd
echo "DONE"

...but it stopped after 20 seconds with the text:

ModuleCmd_Load.c(226):ERROR:105: Unable to locate a modulefile for 'null'
ModuleCmd_Load.c(226):ERROR:105: Unable to locate a modulefile for 'null'
Rank 0 [Thu Sep 3 03:44:01 2015] [c2-1c1s4n0] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0 0x8ABA6D in _gfortran_backtrace at backtrace.c:258
#1 0x8947E0 in _gfortrani_backtrace_handler at compile_options.c:129
#2 0x90C01F in raise
#3 0x90BFDB in raise at pt-raise.c:41
#4 0x91C5D0 in abort at abort.c:92
#5 0x7DED71 in MPID_Abort
#6 0x7BFBD2 in MPI_Abort
#7 0x797BB4 in pmpi_abort
#8 0x4A2156 in __pmemd_lib_mod_MOD_mexit
#9 0x4854CE in __shake_mod_MOD_shake
#10 0x47E2B7 in __runmd_mod_MOD_runmd
#11 0x4B6543 in MAIN__ at pmemd.F90:0
_pmiu_daemon(SIGCHLD): [NID 02000] [c2-1c1s4n0] [Thu Sep 3 03:44:02 2015] PE RANK 0 exit signal Aborted
[NID 02000] 2015-09-03 03:44:02 Apid 17270990: initiated application termination
Application 17270990 exit codes: 134
Application 17270990 exit signals: Killed
Application 17270990 resources: utime ~9s, stime ~3s, Rss ~63044, inblocks ~137873, outblocks ~60547
DONE
--------------------------------------------------------------------------------

Attached there is the produced output file (400 steps and then it stopped).

-it was supposed to be a 3 steps heating production stage, but with three commands the error text was quite the same (error open LAMPLP_heat2.rst, so I tried with just one step but looks like doesn't want to work); it has worked for two stage minimization so I guess it is not a problem of topology and coordinate files.
What might it be?
Thanks
Damiano
________________________________________
From: David A Case [david.case.rutgers.edu]
Sent: 28 August 2015 13:20
To: AMBER Mailing List
Subject: Re: [AMBER] hpc error

On Fri, Aug 28, 2015, Damiano Spadoni wrote:
>
> following your suggests, I tried to run my simulation using many less
> cores and processors and increase them little by little, but a new
> problem occurrs, the hpc facility is writing weird output files (here I
> report my last try on 64 nodes, but the problem is the same even on 2)
> where the same step is repeated more than once and make it impossible to
> finish any jobs. Here I report the error message and some other files
> are attached:

Here is at least one key problem with your script:

aprun -n 768 pmemd -O -i S4F4_md_200ps.in -o S4F4_md200ps32.out -p
SF4.prmtop -c S4F4_heat.rst -r S4F4_md200ps32.rst -x S4F4_md200ps32.mdcrd

You need to be running pmemd.MPI *not* pmemd. You are essentially running the
same (serial) program on every core, and the output is being intermingled.

And, while you are trying to get things working, why not run a much shorter
job? (Set nstlim to 100 or so, with a small value of ntpr.)

Even the correct program (pmemd.MPI) will probably never scale to 768 cores.
Be sure to try a variety of values to optimize this.

....dac

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

This message and any attachment are intended solely for the addressee
and may contain confidential information. If you have received this
message in error, please send it back to me, and immediately delete it.

Please do not use, copy or disclose the information contained in this
message or in any attachment. Any views or opinions expressed by the
author of this email do not necessarily reflect the views of the
University of Nottingham.

This message has been checked for viruses but the contents of an
attachment may still contain software viruses which could damage your
computer system, you are advised to perform your own checks. Email
communications with the University of Nottingham may be monitored as
permitted by UK legislation.

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

application/octet-stream attachment: LAMPLP_heat1.out

Received on Thu Sep 03 2015 - 03:30:03 PDT