Dear Dac,
following your suggests, I tried to run my simulation using many less cores and processors and increase them little by little, but a new problem occurrs, the hpc facility is writing weird output files (here I report my last try on 64 nodes, but the problem is the same even on 2) where the same step is repeated more than once and make it impossible to finish any jobs. Here I report the error message and some other files are attached:
--------------------------------------------------------------------------------
*** enxds6 Job: 3115546.sdb started: 25/08/15 19:09:02 host: mom4 ***
*** enxds6 Job: 3115546.sdb started: 25/08/15 19:09:02 host: mom4 ***
*** enxds6 Job: 3115546.sdb started: 25/08/15 19:09:02 host: mom4 ***
*** enxds6 Job: 3115546.sdb started: 25/08/15 19:09:02 host: mom4 ***
--------------------------------------------------------------------------------
ModuleCmd_Load.c(226):ERROR:105: Unable to locate a modulefile for 'null'
ModuleCmd_Load.c(226):ERROR:105: Unable to locate a modulefile for 'null'
At line 754 of file runfiles.F90 (unit = 7, file = 'mdinfo')
Fortran runtime error: Input/output error
At line 754 of file runfiles.F90 (unit = 7, file = 'mdinfo')
Fortran runtime error: Input/output error
At line 754 of file runfiles.F90 (unit = 7, file = 'mdinfo')
Fortran runtime error: Input/output error
At line 754 of file runfiles.F90 (unit = 7, file = 'mdinfo')
Fortran runtime error: Input/output error
At line 754 of file runfiles.F90 (unit = 7, file = 'mdinfo')
Fortran runtime error: Input/output error
At line 754 of file runfiles.F90 (unit = 7, file = 'mdinfo')
Fortran runtime error: Input/output error
At line 754 of file runfiles.F90 (unit = 7, file = 'mdinfo')
Fortran runtime error: Input/output error
At line 754 of file runfiles.F90 (unit = 7, file = 'mdinfo')
Fortran runtime error: Input/output error
At line 754 of file runfiles.F90 (unit = 7, file = 'mdinfo')
Fortran runtime error: Input/output error
=>> PBS: job killed: walltime 172838 exceeded limit 172800
aprun: Apid 17042225: Caught signal Terminated, sending to application
Application 17042225 exit codes: 2
Application 17042225 exit signals: Terminated
Application 17042225 resources: utime ~9643965s, stime ~139190s, Rss ~231356, inblocks ~354931984, outblocks ~465159151
--------------------------------------------------------------------------------
Resources requested: ncpus=768,place=free,walltime=48:00:00
Resources allocated: cpupercent=1,cput=00:00:04,mem=6496kb,ncpus=768,vmem=141572kb,walltime=48:00:38
*** enxds6 Job: 3115546.sdb ended: 27/08/15 19:09:51 queue: long ***
*** enxds6 Job: 3115546.sdb ended: 27/08/15 19:09:51 queue: long ***
*** enxds6 Job: 3115546.sdb ended: 27/08/15 19:09:51 queue: long ***
*** enxds6 Job: 3115546.sdb ended: 27/08/15 19:09:51 queue: long ***
--------------------------------------------------------------------------------
I tried use the checkValidity command on the .prmtop file, but it is taking so long. Before saying that file might be the problem (strange because on my machine it was correctly working), I'd like you take a look to the rest.
Many thanks
Damiano
________________________________________
From: David A Case [david.case.rutgers.edu]
Sent: 18 August 2015 15:12
To: AMBER Mailing List
Subject: Re: [AMBER] hpc error
On Tue, Aug 18, 2015, Damiano Spadoni wrote:
>
> I am trying to run the following script, involving a production step of
> my simulation, on an hpc facility:
>
> #!/bin/bash --login
>
> #PBS -N S4F4_md
> #PBS -l walltime=24:00:00
> #PBS -l select=4
> #PBS -j oe
> #PBS -A e280-Croft
>
> module load amber
>
> cd /work/e280/e280/enxds6/SF4
>
> aprun -n 96 sander.MPI -O -i S4F4_md.in -o S4F4_md1.out -p SF4.prmtop -c
> S4F4_heat.rst -r S4F4_md1.rst -x S4F4_md1.mdcrd
> aprun -n 96 sander.MPI -O -i S4F4_md.in -o S4F4_md2.out -p SF4.prmtop -c
> S4F4_md1.rst -r S4F4_md2.rst -x S4F4_md2.mdcrd
....
Oh my goodness....
(1) When things are not working, start simple (a single sander run, not 6
of then in sequence).
(2) sander will almost certainly perform very badly (if at all)
on 96 processors. Try running on 2 or 4 to start. Then you can test the
timing of larger numbers.
> partition error in shake on processor 2
> this processor has atoms 13055 through 19568
> atom 19568 is within this range
> atom 19569 is not within this range !
You should look at the output file from the part of the run that failed.
Is it really running on 96 processors? How many atoms are in your system?
(Message above suggests that you have some 13000 atoms per processor....)
If you have a million atoms or more, try running your script a much smaller
system, then gradually build up. Consider whether or not you can use pmemd
instead of sander.
> It is the first time I am trying tu run this simulation on a cluster, I
> previously ran (just the first sander.MPI command) it on my machine and
> it worked, but I want to repeat this simulation on a cluster.
> Any suggestions about something I'm probably missing?
Was your local run also with the same number of processors? The "partition
error" means that the system is trying to SHAKE (constrain) the bond between
atoms 19568 and 19569. Check these atoms: is atom 19569 a hydrogen that is
bonded to atom 19568? Are those two atoms in different residues?
It is possible that partition errors can arise, depending on how complex your
system is. Since this depends on the number of processors you are using, it
could well show up on a cluster but not on a "local machine" (which, I'm
guessing, has fewer than 96 cores.)
Using the "checkValidity" command in parmed may help you localize the problem.
...good luck...dac
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
This message and any attachment are intended solely for the addressee
and may contain confidential information. If you have received this
message in error, please send it back to me, and immediately delete it.
Please do not use, copy or disclose the information contained in this
message or in any attachment. Any views or opinions expressed by the
author of this email do not necessarily reflect the views of the
University of Nottingham.
This message has been checked for viruses but the contents of an
attachment may still contain software viruses which could damage your
computer system, you are advised to perform your own checks. Email
communications with the University of Nottingham may be monitored as
permitted by UK legislation.
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
- application/octet-stream attachment: output
- application/octet-stream attachment: RSTa.dist
Received on Fri Aug 28 2015 - 05:00:03 PDT