[AMBER] pmemd cuda MPI nmropt crashes

From: Scott Brozell <sbrozell.rci.rutgers.edu>
Date: Tue, 6 Nov 2012 01:08:45 -0500

Hi,

All Amber12 floating point versions of pmemd.cuda_*.MPI are failing the nmropt tests.
This consistently happens for 1 to 8 gpus over 1 to 4 nodes on NVIDIA Tesla M2070 GPUs
where each node has 2 gpus:
https://www.osc.edu/supercomputing/hardware#Oakley
Serial pmemd cuda's are passing these tests; in fact, all other tests generally look ok.

I did not notice any other reports of similar failures.
What should be done before a bug report is filed ?

thanks,
scott


--------- versions
amber12/patch_amber.py --patch-level
Latest patch applied to AmberTools12: 28
mpif90 -show
ifort -I/usr/local/mvapich2/1.7-intel/include -I/usr/local/mvapich2/1.7-intel/include -L/usr/local/mvapich2/1.7-intel/lib -lmpichf90 -lmpichf90 -lmpich -lopa -lmpl -lpthread -lhwloc -libverbs -libumad -ldl -lrt
ifort -V
Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 12.1.4.319 Build 20120410
nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Thu_Apr__5_00:24:31_PDT_2012
Cuda compilation tools, release 4.2, V0.2.1221
--------- versions

--------- traces
I had to hack the config.h to get a debug build.
Why doesn't pmemd respect configure's -debug or AMBERBUILDFLAGS ?

Nov 05 1:25:35am 466$ /tmp/pbstmp.522586/test/cuda/nmropt/pme/angle mpiexec pmemd.cuda_DPDP.MPI.debug -O -c ../myoglobin_pbc.inpcrd -p ../myoglobin_pbc.prmtop
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
pmemd.cuda_DPDP.M 000000000054B607 gpu_nmr_setup_ 1575 gpu.cpp
pmemd.cuda_DPDP.M 000000000050B916 nmr_calls_mod_mp_ 3415 nmr_calls.F90
pmemd.cuda_DPDP.M 0000000000432EAB parallel_mod_mp_p 329 parallel.F90
pmemd.cuda_DPDP.M 000000000051C3CE pme_alltasks_setu 174 pme_alltasks_setup.F90
pmemd.cuda_DPDP.M 00000000004F92F7 MAIN__ 204 pmemd.F90
pmemd.cuda_DPDP.M 0000000000404FEC Unknown Unknown Unknown
libc.so.6 000000384421ECDD Unknown Unknown Unknown
pmemd.cuda_DPDP.M 0000000000404EE9 Unknown Unknown Unknown

Nov 05 3:11:40am 521$ /tmp/pbstmp.522586/test/cuda/nmropt/gb/angle mpiexec -n 1 pmemd.cuda_DPDP.MPI.debug -O -c -O -c ../myoglobin_gb.inpcrd -p ../myoglobin_gb.prmtop
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
pmemd.cuda_DPDP.M 000000000054B607 gpu_nmr_setup_ 1575 gpu.cpp
pmemd.cuda_DPDP.M 000000000050B916 nmr_calls_mod_mp_ 3415 nmr_calls.F90
pmemd.cuda_DPDP.M 0000000000520272 gb_alltasks_setup 116 gb_alltasks_setup.F90
pmemd.cuda_DPDP.M 00000000004F84BA MAIN__ 206 pmemd.F90
pmemd.cuda_DPDP.M 0000000000404FEC Unknown Unknown Unknown
libc.so.6 000000384421ECDD Unknown Unknown Unknown
pmemd.cuda_DPDP.M 0000000000404EE9 Unknown Unknown Unknown
mpiexec: Warning: task 0 exited with status 174.

Nov 05 3:20:02am 539$ /tmp/pbstmp.522586/test/cuda/nmropt/pme/temp mpiexec -n 1 pmemd.cuda_DPDP.MPI.debug -O -p ../myoglobin_pbc.prmtop -c ../myoglobin_pbc.inpcrd -i mdin
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
pmemd.cuda_DPDP.M 000000000054B607 gpu_nmr_setup_ 1575 gpu.cpp
pmemd.cuda_DPDP.M 000000000050B916 nmr_calls_mod_mp_ 3415 nmr_calls.F90
pmemd.cuda_DPDP.M 0000000000432EAB parallel_mod_mp_p 329 parallel.F90
pmemd.cuda_DPDP.M 000000000051C3CE pme_alltasks_setu 174 pme_alltasks_setup.F90
pmemd.cuda_DPDP.M 00000000004F92F7 MAIN__ 204 pmemd.F90
pmemd.cuda_DPDP.M 0000000000404FEC Unknown Unknown Unknown
libc.so.6 000000384421ECDD Unknown Unknown Unknown
pmemd.cuda_DPDP.M 0000000000404EE9 Unknown Unknown Unknown
--------- traces

--------- typical test results
testcuda.#nodes.ppn.#gpus.id

testcuda.1.1.1.o522657
26 file comparisons passed
10 file comparisons failed
20 tests experienced errors
--
29 file comparisons passed
7 file comparisons failed
20 tests experienced errors
--
31 file comparisons passed
5 file comparisons failed
20 tests experienced errors
testcuda.1.2.2.o522658
31 file comparisons passed
5 file comparisons failed
20 tests experienced errors
--
29 file comparisons passed
7 file comparisons failed
20 tests experienced errors
--
27 file comparisons passed
0 file comparisons failed
56 tests experienced errors
testcuda.2.2.2.o506511
31 file comparisons passed
5 file comparisons failed
20 tests experienced errors
--
29 file comparisons passed
7 file comparisons failed
20 tests experienced errors
--
35 file comparisons passed
1 file comparison failed
20 tests experienced errors
testcuda.3.2.2.o508216
30 file comparisons passed
6 file comparisons failed
20 tests experienced errors
--
30 file comparisons passed
6 file comparisons failed
20 tests experienced errors
--
35 file comparisons passed
1 file comparison failed
20 tests experienced errors
--------- typical test results
---------  mdout.angle 
          -------------------------------------------------------
          Amber 12 SANDER                              2012
          -------------------------------------------------------
| PMEMD implementation of SANDER, Release 12
| Run on 11/05/2012 at 03:06:38
  [-O]verwriting output
File Assignments:
|   MDIN: mdin                                                                  
|  MDOUT: mdout                                                                 
| INPCRD: ../myoglobin_pbc.inpcrd                                               
|   PARM: ../myoglobin_pbc.prmtop                                               
| RESTRT: restrt                                                                
|   REFC: refc                                                                  
|  MDVEL: mdvel                                                                 
|   MDEN: mden                                                                  
|  MDCRD: mdcrd                                                                 
| MDINFO: mdinfo                                                                
|LOGFILE: logfile                                                               
 
 Here is the input file:
 
Test of angle restraints using nmropt=1 with PBC                               
 &cntrl                                                                        
   nstlim=20,                                                                  
   ntpr=1, ntt=1,                                                              
   dt=0.001,                                                                   
   nmropt=1,                                                                   
   ig=71277,                                                                   
 /                                                                             
 &ewald                                                                        
  nfft1=64, nfft2=64, nfft3=64,netfrc=0,                                       
 /                                                                             
 &wt type='DUMPFREQ', istep1=2  /                                              
 &wt type='END'   /                                                            
DISANG=angle_pbc.RST                                                           
DUMPAVE=angle_pbc_vs_t                                                         
LISTIN=POUT                                                                    
LISTOUT=POUT                                                                   
/                                                                              
 
|--------------------- INFORMATION ----------------------
| GPU (CUDA) Version of PMEMD in use: NVIDIA GPU IN USE.
|                     Version 12.1
| 
|                      08/17/2012
| 
| Implementation by:
|                    Ross C. Walker     (SDSC)
|                    Scott Le Grand     (nVIDIA)
|                    Duncan Poole       (nVIDIA)
| 
| CAUTION: The CUDA code is currently experimental.
|          You use it at your own risk. Be sure to
|          check ALL results carefully.
| 
| Precision model in use:
|      [DPDP] - All Double Precision.
| 
|--------------------------------------------------------
 
|----------------- CITATION INFORMATION -----------------
|
|    When publishing work that utilized the CUDA version
|    of AMBER, please cite the following in addition to
|    the regular AMBER citations:
|
|  - Romelia Salomon-Ferrer; Andreas W. Goetz; Duncan
|    Poole; Scott L. Grand; Ross C. Walker "Routine
|    microsecond molecular dynamics simulations with
|    AMBER - Part II: Particle Mesh Ewald", J. Chem.
|    Theory Comput., 2012, (In Prep).
|
|  - Andreas W. Goetz; Mark J. Williamson; Dong Xu;
|    Duncan Poole; Scott L. Grand; Ross C. Walker
|    "Routine microsecond molecular dynamics simulations
|    with AMBER - Part I: Generalized Born", J. Chem.
|    Theory Comput., 2012, 8 (5), pp1542-1555.
|
|--------------------------------------------------------
 
|------------------- GPU DEVICE INFO --------------------
|
|                         Task ID:      0
|   CUDA Capable Devices Detected:      2
|           CUDA Device ID in use:      0
|                CUDA Device Name: Tesla M2070
|     CUDA Device Global Mem Size:   5375 MB
| CUDA Device Num Multiprocessors:     14
|           CUDA Device Core Freq:   1.15 GHz
|
|--------------------------------------------------------
 
 
| Conditional Compilation Defines Used:
| DIRFRC_COMTRANS
| DIRFRC_EFS
| DIRFRC_NOVEC
| MPI
| PUBFFT
| FFTLOADBAL_2PROC
| BINTRAJ
| CUDA
 
| Largest sphere to fit in unit cell has radius =    26.433
| New format PARM file being parsed.
| Version =    1.000 Date = 10/29/10 Time = 19:03:17
| Note: 1-4 EEL scale factors were NOT found in the topology file.
|       Using default value of 1.2.
| Note: 1-4 VDW scale factors were NOT found in the topology file.
|       Using default value of 2.0.
| Duplicated    0 dihedrals
| Duplicated    0 dihedrals
--------------------------------------------------------------------------------
   1.  RESOURCE   USE: 
--------------------------------------------------------------------------------
 getting new box info from bottom of inpcrd
 NATOM  =   20921 NTYPES =      18 NBONH =   19659 MBONA  =    1297
 NTHETH =    2917 MTHETA =    1761 NPHIH =    5379 MPHIA  =    4347
 NHPARM =       0 NPARM  =       0 NNB   =   38593 NRES   =    6284
 NBONA  =    1297 NTHETA =    1761 NPHIA =    4347 NUMBND =      60
 NUMANG =     125 NPTRA  =      48 NATYP =      36 NPHB   =       1
 IFBOX  =       2 NMXRS  =      73 IFCAP =       0 NEXTRA =       0
 NCOPY  =       0
| Coordinate Index Table dimensions:    11   11   11
| Direct force subcell size =     5.8861    5.8861    5.8861
     BOX TYPE: TRUNCATED OCTAHEDRON
--------------------------------------------------------------------------------
   2.  CONTROL  DATA  FOR  THE  RUN
--------------------------------------------------------------------------------
                                                                                
General flags:
     imin    =       0, nmropt  =       1
Nature and format of input:
     ntx     =       1, irest   =       0, ntrx    =       1
Nature and format of output:
     ntxo    =       1, ntpr    =       1, ntrx    =       1, ntwr    =     500
     iwrap   =       0, ntwx    =       0, ntwv    =       0, ntwe    =       0
     ioutfm  =       0, ntwprt  =       0, idecomp =       0, rbornstat=      0
Potential function:
     ntf     =       1, ntb     =       1, igb     =       0, nsnb    =      25
     ipol    =       0, gbsa    =       0, iesp    =       0
     dielc   =   1.00000, cut     =   8.00000, intdiel =   1.00000
Frozen or restrained atoms:
     ibelly  =       0, ntr     =       0
Molecular dynamics:
     nstlim  =        20, nscm    =      1000, nrespa  =         1
     t       =   0.00000, dt      =   0.00100, vlimit  =  -1.00000
Berendsen (weak-coupling) temperature regulation:
     temp0   = 300.00000, tempi   =   0.00000, tautp   =   1.00000
NMR refinement options:
     iscale  =       0, noeskp  =       1, ipnlty  =       1, mxsub   =       1
     scalm   = 100.00000, pencut  =   0.10000, tausw   =   0.10000
| Intermolecular bonds treatment:
|     no_intermolecular_bonds =       1
| Energy averages sample interval:
|     ene_avg_sampling =       1
Ewald parameters:
     verbose =       0, ew_type =       0, nbflag  =       1, use_pme =       1
     vdwmeth =       1, eedmeth =       1, netfrc  =       0
     Box X =   64.747   Box Y =   64.747   Box Z =   64.747
     Alpha =  109.471   Beta  =  109.471   Gamma =  109.471
     NFFT1 =   64       NFFT2 =   64       NFFT3 =   64
     Cutoff=    8.000   Tol   =0.100E-04
     Ewald Coefficient =  0.34864
     Interpolation order =    4
| PMEMD ewald parallel performance parameters:
|     block_fft =    0
|     fft_blk_y_divisor =    2
|     excl_recip =    0
|     excl_master =    0
|     atm_redist_freq =  320
--------------------------------------------------------------------------------
   3.  ATOMIC COORDINATES AND VELOCITIES
--------------------------------------------------------------------------------
                                                                                
 begin time read from input coords =  5908.800 ps
           Begin reading energy term weight changes/NMR restraints
 WEIGHT CHANGES:
 DUMPFREQ      2      0    0.000000    0.000000      0      0
                         ** No weight changes given **
 RESTRAINTS:
 Requested file redirections:
  DISANG    = angle_pbc.RST
  DUMPAVE   = angle_pbc_vs_t
  LISTIN    = POUT
  LISTOUT   = POUT
 Restraints will be read from file: angle_pbc.RST
Here are comments from the DISANG input file:
#  angle restraint for residue 34                                               
 
******
 HA  (  542)-HB3 (  545)-HG3 (  548)                NSTEP1=     0 NSTEP2=     0
R1 =  45.000 R2 =  90.000 R3 =  90.000 R4 = 115.000 RK2 =  10.000 RK3 =   15.000
 Rcurr:   75.791  Rcurr-(R2+R3)/2:   14.209  MIN(Rcurr-R2,Rcurr-R3):   14.209
                       Number of restraints read =     1
                  Done reading weight changes/NMR restraints
 
---------  mdout.angle 
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Nov 05 2012 - 22:30:03 PST
Custom Search