Re: [AMBER] AMBER11, pmemd.cuda: the system crashed.

From: Ross Walker <ross.rosswalker.co.uk>
Date: Wed, 14 Dec 2011 14:06:52 -0800

Hi Chinh,

This works fine on my system with an up to date patched version of AMBER 11.
Looking at the output you sent me it looks like you are running with an
unpatched version of AMBER 11. Your output has:

|--------------------- INFORMATION ----------------------
| GPU (CUDA) Version of PMEMD in use: NVIDIA GPU IN USE.
|
| Implementation by:
| Ross C. Walker (SDSC)
| Scott Le Grand (nVIDIA)
| Duncan Poole (nVIDIA)
|
| CAUTION: The CUDA code is currently experimental.
| You use it at your own risk. Be sure to
| check ALL results carefully.
|
| Precision model in use:
| [SPDP] - Hybrid Single/Double Precision (Default).
|
|--------------------------------------------------------

While it should have:

|--------------------- INFORMATION ----------------------
| GPU (CUDA) Version of PMEMD in use: NVIDIA GPU IN USE.
| Version 2.2
|
| 08/16/2011
|
|
| Implementation by:
| Ross C. Walker (SDSC)
| Scott Le Grand (nVIDIA)
| Duncan Poole (nVIDIA)
|
| CAUTION: The CUDA code is currently experimental.
| You use it at your own risk. Be sure to
| check ALL results carefully.
|
| Precision model in use:
| [SPDP] - Hybrid Single/Double Precision (Default).
|
|--------------------------------------------------------

Note the version number. Hence you are running with an out of date AMBER 11
and this is almost certainly leading to your issues. Start from a completely
clean Amber 11 directory created by untarring the original Amber11.tar.bz2
file and patch is with the AMBER Bugfixes - see here:
http://ambermd.org/bugfixes11.html - Make sure you get AMBERTools 1.5 and
patch that as well.

I would also note that:

#heating in 6ns at 500K without restraint on the model to unfold the protein

 &cntrl

  imin=0,

  iwrap=0,

  irest=0,

  ntx=1,

  ntb=1,

  cut=10.0,

  ntr=0,

  ntc=2,

  ntf=2,

  tempi=500.0,

  temp0=500.0,

  ntt=3,

  gamma_ln=1.0,

  nstlim=3000000, dt=0.002,

  ntpr=1500, ntwx=1500,ntwr=1000

 /

You are running at 500K but with a 2fs time step. You probably need to
reduce the time step to 1.5fs or so to run at such an elevated temperature.

All the best
Ross


> -----Original Message-----
> From: Chinh Su Tran To [mailto:chinh.sutranto.gmail.com]
> Sent: Tuesday, December 13, 2011 8:04 PM
> To: AMBER Mailing List
> Subject: Re: [AMBER] AMBER11, pmemd.cuda: the system crashed.
>
> Dear Dr. Walker,
>
> I sent you the files and the info to your gmail. Thank you.
>
> Chinh
>
> On Tue, Dec 13, 2011 at 12:09 PM, Ross Walker <ross.rosswalker.co.uk>
> wrote:
>
> > Hi Chinh,
> >
> > Can you send me (offlist) all of your input files please along with
> details
> > of your computer system. OS, NVIDIA compiler and driver version,
> hardware
> > spec (especially the GPU version). I need to be able to replicate
> this in
> > order to investigate what is going wrong.
> >
> > Please also include the output from the run that gave what looked
> like a
> > disk error.
> >
> > Thank you.
> >
> > All the best
> > Ross
> >
> > > -----Original Message-----
> > > From: Chinh Su Tran To [mailto:chinh.sutranto.gmail.com]
> > > Sent: Monday, December 12, 2011 8:04 PM
> > > To: AMBER Mailing List
> > > Subject: Re: [AMBER] AMBER11, pmemd.cuda: the system crashed.
> > >
> > > Dear Dr. Walker,
> > >
> > > As you suggested, we ran the check for the hard disk, but both were
> > > clean!
> > > I tried to run the same code using pmemd only, and it was fine,
> i.e. no
> > > crash, no error.
> > >
> > > But then I returned using pmemd.cuda, it happened again (crashed).
> > >
> > > There were also 2 problems that I noticed when I was using
> pmemd.cuda:
> > >
> > > 1. When I used *iwrap=0* (in the input file as below), it showed
> > > "segmentation
> > > fault" immediately. I knew that it was an old error I encountered
> (but
> > > I
> > > wanted to try it to detect the pmemd.cuda only).
> > > 2. Then I switched it* iwrap=1* with some modification in the
> > > *gpu.cpp*(the solution that I found in the AMBER forum), it
> crashed. (
> > > *However, it also crashed before these modifications)*
> > >
> > > Please help. We did not know what was wrong.
> > >
> > > The input is:
> > >
> > > &cntrl
> > > imin=0,
> > > * iwrap=1 => I also tried iwrap=0*
> > > irest=0,
> > > ntx=1,
> > > ntb=1,
> > > cut=10.0,
> > > ntr=0,
> > > ntc=2,
> > > ntf=2,
> > > tempi=500.0,
> > > temp0=500.0,
> > > ntt=3,
> > > gamma_ln=1.0,
> > > nstlim=3000000, dt=0.002,
> > > ntpr=1500, ntwx=1500,ntwr=1000
> > > /
> > >
> > >
> > > Thank you.
> > > Chinsu
> > >
> > >
> > >
> > >
> > > On Tue, Dec 6, 2011 at 1:01 PM, Ross Walker <ross.rosswalker.co.uk>
> > > wrote:
> > >
> > > > Hi Chinsu,
> > > >
> > > > This looks like a hard drive failure to me (or pending hard drive
> > > failure).
> > > > Please try things with the CPU version of the code and see what
> > > happens. I
> > > > can't see how this could be generated by the GPU code. You might
> want
> > > to
> > > > try
> > > > booting the machine in single user (or recovery mode) and running
> an
> > > fsck
> > > > on
> > > > the file system. You could also try running a smartctl check on
> the
> > > hard
> > > > drive to see what it's diagnostics are reporting.
> > > >
> > > > All the best
> > > > Ross
> > > >
> > > > > -----Original Message-----
> > > > > From: Chinh Su Tran To [mailto:chinh.sutranto.gmail.com]
> > > > > Sent: Monday, December 05, 2011 7:33 PM
> > > > > To: AMBER Mailing List
> > > > > Subject: [AMBER] AMBER11, pmemd.cuda: the system crashed.
> > > > >
> > > > > Dear AMBER users,
> > > > >
> > > > > I was running pmemd.cuda using Amber11 on a GPU which is
> installed
> > > in a
> > > > > workstation.
> > > > > The process was of 2 steps of short minimizations and a 6ns of
> > > heating
> > > > > the
> > > > > protein (270 residues) at 500K.
> > > > >
> > > > > The minimizations were fine, but when i ran the heating, my
> system
> > > > > "crashed". The errors are as below:
> > > > >
> > > > > *[2932683.628873] EXT4-fs (sda1): previous I/O error to
> superblock
> > > > > detected*
> > > > > *
> > > > > [2932683.646789] EXT4-fs (device sda1): ext4_find_entry:933:
> > > > > inode#9699494:
> > > > > comm init: reading directory lblock 0**
> > > > > *
> > > > > *
> > > > > *
> > > > > I re-booted the system, and tried to run it again. The same
> errors
> > > > > came.
> > > > > My input file is:
> > > > >
> > > > > &cntrl
> > > > > imin=0,
> > > > > * iwrap=1 => I also tried iwrap=0*
> > > > > irest=0,
> > > > > ntx=1,
> > > > > ntb=1,
> > > > > cut=10.0,
> > > > > ntr=0,
> > > > > ntc=2,
> > > > > ntf=2,
> > > > > tempi=500.0,
> > > > > temp0=500.0,
> > > > > ntt=3,
> > > > > gamma_ln=1.0,
> > > > > nstlim=3000000, dt=0.002,
> > > > > ntpr=1500, ntwx=1500,ntwr=1000
> > > > > /
> > > > > *
> > > > > *
> > > > > *
> > > > > *
> > > > > We tried to find out what was going on, but we did not know
> where
> > > the
> > > > > crash
> > > > > came from?
> > > > > Please help.
> > > > >
> > > > > Thank you.
> > > > >
> > > > > Regards,
> > > > > Chinsu
> > > > > _______________________________________________
> > > > > AMBER mailing list
> > > > > AMBER.ambermd.org
> > > > > http://lists.ambermd.org/mailman/listinfo/amber
> > > >
> > > >
> > > > _______________________________________________
> > > > AMBER mailing list
> > > > AMBER.ambermd.org
> > > > http://lists.ambermd.org/mailman/listinfo/amber
> > > >
> > > _______________________________________________
> > > AMBER mailing list
> > > AMBER.ambermd.org
> > > http://lists.ambermd.org/mailman/listinfo/amber
> >
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber


_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Dec 14 2011 - 14:30:02 PST
Custom Search