[AMBER] Problematic structure after minimization on GPU

From: Jan-Philip Gehrcke <jgehrcke.googlemail.com>
Date: Mon, 26 Aug 2013 16:25:00 +0200

Hello,

I have a test case for you. It is reproducibly failing on GTX 580, GTX
690, Tesla C2070 using pmemd.cuda (version 12.3.1, 08/07/2013).

The system in question is a rather small system. After going through two
minimizations, it fails within the first steps of heatup with

Error: unspecified launch failure launching kernel kNLSkinTest

The problem seems to be in the output structure of the second
minimization. When starting heatup from there using the CPU version of
pmemd (and same input otherwise), this also fails within a few steps.
After the first step, pmemd says in the mdout file:

vlimit exceeded for step 0; vmax = 28405.4406


After the third step the simulation crashes:

vlimit exceeded for step 3; vmax = 64.6955

       Coordinate resetting cannot be accomplished,
       deviation is too large
       iter_cnt, my_bond_idx, i and j are : 2 948 435 434

Running the entire protocol (min1, min2, heatup) with the CPU version, I
don't observe the problem at all, probably because the minimization
takes a different 'path'.

The problematic system seems to hit an *extremely* special and therefore
unlikely coordinate constellation. Let me explain why I believe this is
so rare:

In my current study I perform independent simulations of many systems
comprised of the same receptor protein and a relatively small ligand
molecule, placed distal from the receptor in the (explicit) solvent.
Initially, all systems have equivalent receptor coordinates. The ligand
molecule is the same in all systems. The internal configuration of the
ligand is equivalent in all systems. The placement of the ligand's
center of mass is equivalent in all systems. The systems only differ in
the rotational state of the ligand around its COM. All of these systems
evolve fine during minimization, heatup, equilibration and production.
Except for the one that reproducibly fails during heatup. I can make it
not to fail during heatup by setting maxcyc from 1000 to 700 in the
first minimization -- so this really seems to be an unfortunate und
unlikely combination of conditions. And if it wasn't for the awesome
simulation reproducibility of recent Amber GPU code, I probably would
not have observed this more than once.

Regarding the problematic system, the starting structure for heatup (the
last restart file of the second minimization), visualized in VMD, looks
fine: the ligand is still faaar away from the protein, beautiful water
molecules as placed by leap (and already slightly wiggled) are present.
I could not find any clashes in that structure (automated search with
MOE -- is there a way to automate this in VMD?), so to me there is no
obvious problem with that file.

Visualizing the heatup trajectory recorded with ntwx=1 just shows that
the system suddenly explodes, in frame 20 or so.

I think it is also worth pointing out that

- I have used the same heatup input settings for a long time now,
applied to various systems. They have reliably worked so far.

- the heatup reproducibly fails on GPU and CPU (when starting from
min2.rst as created by GPU code) with 'ig = -1', so this does not depend
on any specific random number sequence in the heatup simulation.

- the problem in min2.rst does not depend on the storage format (I tried
ASCII and NetCDF).


Obviously I can simply work around this problem. However, I found it
important to share with you, because

- the error message in the GPU version can be improved. The CPU version
informs about crazy velocities. The GPU version just says 'launch
failure launching kernel kNLSkinTest'.

- it is very interesting to understand what is wrong with the structure
in min2.rst as created by the GPU version, hopefully someone can find
this out and clarify.

- there might be a problem in the GPU minimization code that 'creates'
this problematic structure.

For testing purposes, I have created an archive for you:

http://gehrcke.de/files/perm/amber130826/heatup-fail-repro.tar.gz (~ 700 kB)

It contains the initial coordinate file and the parameter topology file
of the system (created by leap). It also contains the shell script
'repro.sh' that triggers the problem. For the mailing list archives, I
also place the content of the script to this mail, below in plain text.


Cheers,

Jan-Philip


repro.sh:

#!/bin/bash

#ENGINE="mpirun -np 16 pmemd.MPI"
ENGINE="pmemd.cuda"

err() {
     # Print error message to stderr.
     echo "$." 1>&2;
     }

print_run_command () {
     echo "Running command:"
     echo "${1}"
     eval "${1}"
     }


PRMTOP="top.prmtop"
INITCRD="initcoords.crd"
MIN1PREFIX="min1"
MIN2PREFIX="min2"
MIN1FILE="${MIN1PREFIX}.in"
MIN2FILE="${MIN2PREFIX}.in"
HEATPREFIX="heatup_NVT"
HEATINFILE="${HEATPREFIX}.in"
EQUIPREFIX="equilibrate_NPT"
EQUIINFILE="${EQUIPREFIX}.in"


echo "Writing minimization input file ${MIN1FILE}."
echo "
&cntrl
  imin = 1,
  maxcyc = 1000,
  ncyc = 500,
  ntb = 1,
  ntr = 1,
  cut = 8.0
  ig = -1
  ntxo = 2,
  restraint_wt = 500.0,
  restraintmask = \"!:WAT\",
/
" > ${MIN1FILE}


echo "Writing minimization input file ${MIN2FILE}."
echo "
&cntrl
  imin = 1,
  maxcyc = 1000,
  ncyc = 500,
  ntb = 1,
  ntr = 0,
  cut = 8.0,
/
" > ${MIN2FILE}


echo "Running first minimization (fixed solute)..."
CMD="time ${ENGINE} -O -i ${MIN1FILE} -o ${MIN1PREFIX}.out -p ${PRMTOP} \
      -c ${INITCRD} -r ${MIN1PREFIX}.rst -ref ${INITCRD}"
print_run_command "${CMD}"
if [ $? != 0 ]; then
     err "Error during first minimization. Exit."
     exit 1
fi

echo "Running second minimization (entire system is flexible)..."
CMD="time ${ENGINE} -O -i ${MIN2FILE} -o ${MIN2PREFIX}.out -p ${PRMTOP} \
      -c ${MIN1PREFIX}.rst -r ${MIN2PREFIX}.rst -ref ${INITCRD}"
print_run_command "${CMD}"
if [ $? != 0 ]; then
     err "Error during second minimization. Exit."
     exit 1
fi

echo "Writing input file ${HEATINFILE}."
echo "
&cntrl
  ntx = 1,
  ntb = 1,
  cut = 8.0,
  ntr = 1,
  ntc = 2,
  ntf = 2,
  tempi = 200,
  temp0 = 300,
  ntt = 3,
  gamma_ln = 1.0,
  nstlim = 200,
  dt = 0.002,
  ntpr = 1,
  ntwx = 1,
  ntwr = 100,
  ioutfm = 1,
  ntxo = 2,
  ig = -1,
  ntr = 1,
  restraint_wt = 5.0,
  restraintmask = \"!:WAT\",
/
" > ${HEATINFILE}

echo "Running heatup..."
CMD="time ${ENGINE} -O -i ${HEATINFILE} -o ${HEATPREFIX}.out -p ${PRMTOP} \
      -c ${MIN2PREFIX}.rst -r ${HEATPREFIX}.rst -x ${HEATPREFIX}.mdcrd \
      -ref ${MIN2PREFIX}.rst"
print_run_command "${CMD}"
if [ $? != 0 ]; then
     err "Error during heatup. Exit."
     exit 1
fi



_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Aug 26 2013 - 07:30:03 PDT
Custom Search