[AMBER] Intermittent error during T-REMD on 8 GPU computer

From: Milo Westler <milo.nmrfam.wisc.edu>
Date: Fri, 2 May 2014 09:33:48 -0500

I have had this occur 3 or 4 times during the T-REMD runs that I am
preforming. The energies and temperatures suddenly jump to large values
thus aborting the run. I am running on an Exxact 8X GTX 780 GPU system
running pmemd.cuda.MPI in Amber 12. The GPUs passed the GPU_Validation_Test
provided with the computer and the error shows up on different replicas
(see additional examples at the bottom). Is this my error or a bug?

MDin file (1 of 8 temperatures spread from 300-372):
 imin=0, ntx=1,
 nstlim=500, dt=0.002,
  irest=0, ntt=3, gamma_ln=1.0,
  tempi=346, temp0=346,
  ntpr=100, ntwx=1000, ntwr=100000
 ntb=0, igb=7,
  numexchg=20001, ioutfm=1

groupfile (remd.groupfile_16):
-O -rem 1 -remlog rem16.log -i remd300.in -o remd300_16.out -c
remd300_15.rst -r remd300_16.rst -x remd300_16.nc -inf remd300_16.mdinfo -p
-O -rem 1 -remlog rem16.log -i remd311.in -o remd311_16.out -c
remd311_15.rst -r remd311_16.rst -x remd311_16.nc -inf remd311_16.mdinfo -p
-O -rem 1 -remlog rem16.log -i remd322.in -o remd322_16.out -c
remd322_15.rst -r remd322_16.rst -x remd322_16.nc -inf remd322_16.mdinfo -p
-O -rem 1 -remlog rem16.log -i remd334.in -o remd334_16.out -c
remd334_15.rst -r remd334_16.rst -x remd334_16.nc -inf remd334_16.mdinfo -p
-O -rem 1 -remlog rem16.log -i remd346.in -o remd346_16.out -c
remd346_15.rst -r remd346_16.rst -x remd346_16.nc -inf remd346_16.mdinfo -p
-O -rem 1 -remlog rem16.log -i remd359.in -o remd359_16.out -c
remd359_15.rst -r remd359_16.rst -x remd359_16.nc -inf remd359_16.mdinfo -p
-O -rem 1 -remlog rem16.log -i remd372.in -o remd372_16.out -c
remd372_15.rst -r remd372_16.rst -x remd372_16.nc -inf remd372_16.mdinfo -p
-O -rem 1 -remlog rem16.log -i remd378.in -o remd378_16.out -c
remd378_15.rst -r remd378_16.rst -x remd378_16.nc -inf remd378_16.mdinfo -p

mpirun -np 8 pmemd.cuda.MPI -ng 8 -groupfile &

mdout file:

 NSTEP = 3992200 TIME(PS) = 318999.400 TEMP(K) = 315.01 PRESS =
 Etot = 348.8084 EKtot = 1807.5479 EPtot =
 BOND = 849.0991 ANGLE = 1146.7702 DIHED =
 1-4 NB = 448.2307 1-4 EEL = 6878.1710 VDWAALS =
 EELEC = -8421.3508 EGB = -3796.7529 RESTRAINT =
 TEMP0 = 322.0000 REPNUM = 2 EXCHANGE# =

 NSTEP = 3992300 TIME(PS) = 318999.600 TEMP(K) =********* PRESS =
 Etot = ************** EKtot = ************** EPtot =
 BOND = -0.0000 ANGLE = 195571.6650 DIHED =
 1-4 NB = -0.0000 1-4 EEL = -19.6864 VDWAALS =
 EELEC = 46.0865 EGB = -28295.9104 RESTRAINT =
 TEMP0 = 322.0000 REPNUM = 2 EXCHANGE# =

rem.log file:

# exchange 7985
 1 -1.00 350.26 -1086.81 359.00 359.00 0.09 -1
 2 -1.00 324.23 -1482.62 322.00 322.00 0.10 -1
 3 -1.00 332.86 -1313.69 346.00 346.00 0.04 -1
 4 -1.00 391.67 -806.15 378.00 378.00 0.00 -1
 5 -1.00 340.65 -1345.66 334.00 334.00 0.07 -1
 6 -1.00 303.53 -1678.33 300.00 300.00 0.12 -1
 7 -1.00 370.59 -1000.53 372.00 372.00 0.10 -1
 8 -1.00 311.41 -1616.47 311.00 311.00 0.14 -1
# exchange 7986
 1 -1.00 349.12 -1149.07 359.00 359.00 0.09 -1
 2 1.02********** 162161.52 322.00 334.00 0.10 -1
 3 -1.00 340.05 -1301.86 346.00 346.00 0.04 -1
 4 -1.00 385.50 -767.67 378.00 378.00 0.00 -1
 5 0.98 343.15 -1319.26 334.00 322.00 0.07 -1
 6 -1.00 308.64 -1680.36 300.00 300.00 0.12 -1
 7 -1.00 386.05 -984.44 372.00 372.00 0.10 -1
 8 -1.00 312.43 -1617.70 311.00 311.00 0.14 -1

Additional examples:
Run1: equil334_12.out


# exchange 3119
 1 -1.00 339.21 -1395.47 334.00 334.00 0.15 -1
 2 -1.00 349.61 -1274.88 346.00 346.00 0.03 -1
 3 -1.00 354.72 -1146.44 359.00 359.00 0.15 -1
 4 -1.00 379.38 -872.46 385.00 385.00 0.00 -1
 5 -1.00 371.22 -996.92 372.00 372.00 0.06 -1
 6 -1.00 319.71 -1494.90 322.00 322.00 0.02 -1
 7 -1.00 300.48 -1683.57 300.00 300.00 0.07 -1
 8 -1.00 304.50 -1683.02 311.00 311.00 0.13 -1
# exchange 3120
 1 -1.00 339.35 -1381.44 334.00 334.00 0.15 -1
 2 -1.00 347.71 -1240.52 346.00 346.00 0.03 -1
 3 -1.00 360.20 -1086.91 359.00 359.00 0.15 -1
 4 -1.00********** 159280.22 385.00 385.00 0.00 -1
 5 -1.00 372.48 -1021.74 372.00 372.00 0.06 -1
 6 -1.00 309.75 -1541.63 322.00 322.00 0.02 -1
 7 1.02 305.76 -1674.30 300.00 311.00 0.07 -1
 8 0.98 314.24 -1676.61 311.00 300.00 0.13 -1
# exchange 86485
 1 -1.00 374.25 -1054.62 357.00 357.00 0.03 -1
 2 -1.00 314.90 -1458.74 314.00 314.00 0.04 -1
 3 -1.00 381.39 -1037.09 371.00 371.00 0.02 -1
 4 -1.00 299.78 -1590.98 300.00 300.00 0.04 -1
 5 -1.00 325.04 -1372.42 329.00 329.00 0.08 -1
 6 -1.00 338.35 -1275.97 343.00 343.00 0.08 -1
 7 -1.00 396.13 -603.83 386.00 386.00 0.07 -1
 8 -1.00 409.07 -600.18 400.00 400.00 0.00 -1
# exchange 86486
 1 -1.00 367.96 -1060.63 357.00 357.00 0.03 -1
 2 -1.00 320.05 -1471.71 314.00 314.00 0.04 -1
 3 -1.00********** 165628.47 371.00 371.00 0.02 -1
 4 -1.00 289.05 -1632.90 300.00 300.00 0.04 -1
 5 -1.00 331.56 -1300.85 329.00 329.00 0.08 -1
 6 -1.00 351.93 -1277.38 343.00 343.00 0.08 -1
 7 1.02 385.08 -626.59 386.00 400.00 0.07 -1
 8 0.98 388.68 -664.82 400.00 386.00 0.00 -1

-- Milo
National Magnetic Resonance Facility at Madison
      An NIH-Supported Resource Center
W. Milo Westler, Ph.D.
NMRFAM Director
Senior Scientist
Adjunct Professor
Department of Biochemistry
University of Wisconsin-Madison
433 Babcock Drive
Rm B160D
Madison, WI USA 53706-1544
EMAIL: milo.nmrfam.wisc.edu
PHONE: (608)-263-9599
FAX: (608)-263-1722
AMBER mailing list
Received on Fri May 02 2014 - 08:00:02 PDT
Custom Search