Re: [AMBER] pmemd cuda MPI nmropt crashes

From: Ross Walker <ross.rosswalker.co.uk>
Date: Mon, 05 Nov 2012 22:29:08 -0700

Hi Scott,

We know. Scott Le Grand and I are looking into it although other pesky
things keep getting in the way. Hope to have a fix soon along with CUDA
5.0 support.

For now just run single GPU if using NMROPT.

All the best
Ross


On 11/5/12 11:08 PM, "Scott Brozell" <sbrozell.rci.rutgers.edu> wrote:

>Hi,
>
>All Amber12 floating point versions of pmemd.cuda_*.MPI are failing the
>nmropt tests.
>This consistently happens for 1 to 8 gpus over 1 to 4 nodes on NVIDIA
>Tesla M2070 GPUs
>where each node has 2 gpus:
>https://www.osc.edu/supercomputing/hardware#Oakley
>Serial pmemd cuda's are passing these tests; in fact, all other tests
>generally look ok.
>
>I did not notice any other reports of similar failures.
>What should be done before a bug report is filed ?
>
>thanks,
>scott
>
>
>--------- versions
>amber12/patch_amber.py --patch-level
>Latest patch applied to AmberTools12: 28
>mpif90 -show
>ifort -I/usr/local/mvapich2/1.7-intel/include
>-I/usr/local/mvapich2/1.7-intel/include
>-L/usr/local/mvapich2/1.7-intel/lib -lmpichf90 -lmpichf90 -lmpich -lopa
>-lmpl -lpthread -lhwloc -libverbs -libumad -ldl -lrt
>ifort -V
>Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on
>Intel(R) 64, Version 12.1.4.319 Build 20120410
>nvcc -V
>nvcc: NVIDIA (R) Cuda compiler driver
>Copyright (c) 2005-2012 NVIDIA Corporation
>Built on Thu_Apr__5_00:24:31_PDT_2012
>Cuda compilation tools, release 4.2, V0.2.1221
>--------- versions
>
>--------- traces
>I had to hack the config.h to get a debug build.
>Why doesn't pmemd respect configure's -debug or AMBERBUILDFLAGS ?
>
>Nov 05 1:25:35am 466$ /tmp/pbstmp.522586/test/cuda/nmropt/pme/angle
>mpiexec pmemd.cuda_DPDP.MPI.debug -O -c ../myoglobin_pbc.inpcrd -p
>../myoglobin_pbc.prmtop
>forrtl: severe (174): SIGSEGV, segmentation fault occurred
>Image PC Routine Line
>Source
>pmemd.cuda_DPDP.M 000000000054B607 gpu_nmr_setup_ 1575
>gpu.cpp
>pmemd.cuda_DPDP.M 000000000050B916 nmr_calls_mod_mp_ 3415
>nmr_calls.F90
>pmemd.cuda_DPDP.M 0000000000432EAB parallel_mod_mp_p 329
>parallel.F90
>pmemd.cuda_DPDP.M 000000000051C3CE pme_alltasks_setu 174
>pme_alltasks_setup.F90
>pmemd.cuda_DPDP.M 00000000004F92F7 MAIN__ 204
>pmemd.F90
>pmemd.cuda_DPDP.M 0000000000404FEC Unknown Unknown
>Unknown
>libc.so.6 000000384421ECDD Unknown Unknown
>Unknown
>pmemd.cuda_DPDP.M 0000000000404EE9 Unknown Unknown
>Unknown
>
>Nov 05 3:11:40am 521$ /tmp/pbstmp.522586/test/cuda/nmropt/gb/angle
>mpiexec -n 1 pmemd.cuda_DPDP.MPI.debug -O -c -O -c ../myoglobin_gb.inpcrd
>-p ../myoglobin_gb.prmtop
>forrtl: severe (174): SIGSEGV, segmentation fault occurred
>Image PC Routine Line
>Source
>pmemd.cuda_DPDP.M 000000000054B607 gpu_nmr_setup_ 1575
>gpu.cpp
>pmemd.cuda_DPDP.M 000000000050B916 nmr_calls_mod_mp_ 3415
>nmr_calls.F90
>pmemd.cuda_DPDP.M 0000000000520272 gb_alltasks_setup 116
>gb_alltasks_setup.F90
>pmemd.cuda_DPDP.M 00000000004F84BA MAIN__ 206
>pmemd.F90
>pmemd.cuda_DPDP.M 0000000000404FEC Unknown Unknown
>Unknown
>libc.so.6 000000384421ECDD Unknown Unknown
>Unknown
>pmemd.cuda_DPDP.M 0000000000404EE9 Unknown Unknown
>Unknown
>mpiexec: Warning: task 0 exited with status 174.
>
>Nov 05 3:20:02am 539$ /tmp/pbstmp.522586/test/cuda/nmropt/pme/temp
>mpiexec -n 1 pmemd.cuda_DPDP.MPI.debug -O -p ../myoglobin_pbc.prmtop -c
>../myoglobin_pbc.inpcrd -i mdin
>forrtl: severe (174): SIGSEGV, segmentation fault occurred
>Image PC Routine Line
>Source
>pmemd.cuda_DPDP.M 000000000054B607 gpu_nmr_setup_ 1575
>gpu.cpp
>pmemd.cuda_DPDP.M 000000000050B916 nmr_calls_mod_mp_ 3415
>nmr_calls.F90
>pmemd.cuda_DPDP.M 0000000000432EAB parallel_mod_mp_p 329
>parallel.F90
>pmemd.cuda_DPDP.M 000000000051C3CE pme_alltasks_setu 174
>pme_alltasks_setup.F90
>pmemd.cuda_DPDP.M 00000000004F92F7 MAIN__ 204
>pmemd.F90
>pmemd.cuda_DPDP.M 0000000000404FEC Unknown Unknown
>Unknown
>libc.so.6 000000384421ECDD Unknown Unknown
>Unknown
>pmemd.cuda_DPDP.M 0000000000404EE9 Unknown Unknown
>Unknown
>--------- traces
>
>--------- typical test results
>testcuda.#nodes.ppn.#gpus.id
>
>testcuda.1.1.1.o522657
>26 file comparisons passed
>10 file comparisons failed
>20 tests experienced errors
>--
>29 file comparisons passed
>7 file comparisons failed
>20 tests experienced errors
>--
>31 file comparisons passed
>5 file comparisons failed
>20 tests experienced errors
>
>testcuda.1.2.2.o522658
>31 file comparisons passed
>5 file comparisons failed
>20 tests experienced errors
>--
>29 file comparisons passed
>7 file comparisons failed
>20 tests experienced errors
>--
>27 file comparisons passed
>0 file comparisons failed
>56 tests experienced errors
>
>testcuda.2.2.2.o506511
>31 file comparisons passed
>5 file comparisons failed
>20 tests experienced errors
>--
>29 file comparisons passed
>7 file comparisons failed
>20 tests experienced errors
>--
>35 file comparisons passed
>1 file comparison failed
>20 tests experienced errors
>
>testcuda.3.2.2.o508216
>30 file comparisons passed
>6 file comparisons failed
>20 tests experienced errors
>--
>30 file comparisons passed
>6 file comparisons failed
>20 tests experienced errors
>--
>35 file comparisons passed
>1 file comparison failed
>20 tests experienced errors
>--------- typical test results
>
>--------- mdout.angle
>
> -------------------------------------------------------
> Amber 12 SANDER 2012
> -------------------------------------------------------
>
>| PMEMD implementation of SANDER, Release 12
>
>| Run on 11/05/2012 at 03:06:38
>
> [-O]verwriting output
>
>File Assignments:
>| MDIN: mdin
>
>| MDOUT: mdout
>
>| INPCRD: ../myoglobin_pbc.inpcrd
>
>| PARM: ../myoglobin_pbc.prmtop
>
>| RESTRT: restrt
>
>| REFC: refc
>
>| MDVEL: mdvel
>
>| MDEN: mden
>
>| MDCRD: mdcrd
>
>| MDINFO: mdinfo
>
>|LOGFILE: logfile
>
>
>
> Here is the input file:
>
>Test of angle restraints using nmropt=1 with PBC
>
> &cntrl
>
> nstlim=20,
>
> ntpr=1, ntt=1,
>
> dt=0.001,
>
> nmropt=1,
>
> ig=71277,
>
> /
>
> &ewald
>
> nfft1=64, nfft2=64, nfft3=64,netfrc=0,
>
> /
>
> &wt type='DUMPFREQ', istep1=2 /
>
> &wt type='END' /
>
>DISANG=angle_pbc.RST
>
>DUMPAVE=angle_pbc_vs_t
>
>LISTIN=POUT
>
>LISTOUT=POUT
>
>/
>
>
>
>
>|--------------------- INFORMATION ----------------------
>| GPU (CUDA) Version of PMEMD in use: NVIDIA GPU IN USE.
>| Version 12.1
>|
>| 08/17/2012
>|
>| Implementation by:
>| Ross C. Walker (SDSC)
>| Scott Le Grand (nVIDIA)
>| Duncan Poole (nVIDIA)
>|
>| CAUTION: The CUDA code is currently experimental.
>| You use it at your own risk. Be sure to
>| check ALL results carefully.
>|
>| Precision model in use:
>| [DPDP] - All Double Precision.
>|
>|--------------------------------------------------------
>
>|----------------- CITATION INFORMATION -----------------
>|
>| When publishing work that utilized the CUDA version
>| of AMBER, please cite the following in addition to
>| the regular AMBER citations:
>|
>| - Romelia Salomon-Ferrer; Andreas W. Goetz; Duncan
>| Poole; Scott L. Grand; Ross C. Walker "Routine
>| microsecond molecular dynamics simulations with
>| AMBER - Part II: Particle Mesh Ewald", J. Chem.
>| Theory Comput., 2012, (In Prep).
>|
>| - Andreas W. Goetz; Mark J. Williamson; Dong Xu;
>| Duncan Poole; Scott L. Grand; Ross C. Walker
>| "Routine microsecond molecular dynamics simulations
>| with AMBER - Part I: Generalized Born", J. Chem.
>| Theory Comput., 2012, 8 (5), pp1542-1555.
>|
>|--------------------------------------------------------
>
>|------------------- GPU DEVICE INFO --------------------
>|
>| Task ID: 0
>| CUDA Capable Devices Detected: 2
>| CUDA Device ID in use: 0
>| CUDA Device Name: Tesla M2070
>| CUDA Device Global Mem Size: 5375 MB
>| CUDA Device Num Multiprocessors: 14
>| CUDA Device Core Freq: 1.15 GHz
>|
>|--------------------------------------------------------
>
>
>| Conditional Compilation Defines Used:
>| DIRFRC_COMTRANS
>| DIRFRC_EFS
>| DIRFRC_NOVEC
>| MPI
>| PUBFFT
>| FFTLOADBAL_2PROC
>| BINTRAJ
>| CUDA
>
>| Largest sphere to fit in unit cell has radius = 26.433
>
>| New format PARM file being parsed.
>| Version = 1.000 Date = 10/29/10 Time = 19:03:17
>
>| Note: 1-4 EEL scale factors were NOT found in the topology file.
>| Using default value of 1.2.
>
>| Note: 1-4 VDW scale factors were NOT found in the topology file.
>| Using default value of 2.0.
>| Duplicated 0 dihedrals
>
>| Duplicated 0 dihedrals
>
>--------------------------------------------------------------------------
>------
> 1. RESOURCE USE:
>--------------------------------------------------------------------------
>------
>
> getting new box info from bottom of inpcrd
>
> NATOM = 20921 NTYPES = 18 NBONH = 19659 MBONA = 1297
> NTHETH = 2917 MTHETA = 1761 NPHIH = 5379 MPHIA = 4347
> NHPARM = 0 NPARM = 0 NNB = 38593 NRES = 6284
> NBONA = 1297 NTHETA = 1761 NPHIA = 4347 NUMBND = 60
> NUMANG = 125 NPTRA = 48 NATYP = 36 NPHB = 1
> IFBOX = 2 NMXRS = 73 IFCAP = 0 NEXTRA = 0
> NCOPY = 0
>
>| Coordinate Index Table dimensions: 11 11 11
>| Direct force subcell size = 5.8861 5.8861 5.8861
>
> BOX TYPE: TRUNCATED OCTAHEDRON
>
>--------------------------------------------------------------------------
>------
> 2. CONTROL DATA FOR THE RUN
>--------------------------------------------------------------------------
>------
>
>
>
>
>General flags:
> imin = 0, nmropt = 1
>
>Nature and format of input:
> ntx = 1, irest = 0, ntrx = 1
>
>Nature and format of output:
> ntxo = 1, ntpr = 1, ntrx = 1, ntwr =
> 500
> iwrap = 0, ntwx = 0, ntwv = 0, ntwe =
> 0
> ioutfm = 0, ntwprt = 0, idecomp = 0, rbornstat=
> 0
>
>Potential function:
> ntf = 1, ntb = 1, igb = 0, nsnb =
> 25
> ipol = 0, gbsa = 0, iesp = 0
> dielc = 1.00000, cut = 8.00000, intdiel = 1.00000
>
>Frozen or restrained atoms:
> ibelly = 0, ntr = 0
>
>Molecular dynamics:
> nstlim = 20, nscm = 1000, nrespa = 1
> t = 0.00000, dt = 0.00100, vlimit = -1.00000
>
>Berendsen (weak-coupling) temperature regulation:
> temp0 = 300.00000, tempi = 0.00000, tautp = 1.00000
>
>NMR refinement options:
> iscale = 0, noeskp = 1, ipnlty = 1, mxsub =
> 1
> scalm = 100.00000, pencut = 0.10000, tausw = 0.10000
>
>| Intermolecular bonds treatment:
>| no_intermolecular_bonds = 1
>
>| Energy averages sample interval:
>| ene_avg_sampling = 1
>
>Ewald parameters:
> verbose = 0, ew_type = 0, nbflag = 1, use_pme =
> 1
> vdwmeth = 1, eedmeth = 1, netfrc = 0
> Box X = 64.747 Box Y = 64.747 Box Z = 64.747
> Alpha = 109.471 Beta = 109.471 Gamma = 109.471
> NFFT1 = 64 NFFT2 = 64 NFFT3 = 64
> Cutoff= 8.000 Tol =0.100E-04
> Ewald Coefficient = 0.34864
> Interpolation order = 4
>
>| PMEMD ewald parallel performance parameters:
>| block_fft = 0
>| fft_blk_y_divisor = 2
>| excl_recip = 0
>| excl_master = 0
>| atm_redist_freq = 320
>
>--------------------------------------------------------------------------
>------
> 3. ATOMIC COORDINATES AND VELOCITIES
>--------------------------------------------------------------------------
>------
>
>
>
> begin time read from input coords = 5908.800 ps
>
>
>
> Begin reading energy term weight changes/NMR restraints
> WEIGHT CHANGES:
> DUMPFREQ 2 0 0.000000 0.000000 0 0
> ** No weight changes given **
>
> RESTRAINTS:
> Requested file redirections:
> DISANG = angle_pbc.RST
> DUMPAVE = angle_pbc_vs_t
> LISTIN = POUT
> LISTOUT = POUT
> Restraints will be read from file: angle_pbc.RST
>Here are comments from the DISANG input file:
># angle restraint for residue 34
>
>
>******
> HA ( 542)-HB3 ( 545)-HG3 ( 548) NSTEP1= 0 NSTEP2=
> 0
>R1 = 45.000 R2 = 90.000 R3 = 90.000 R4 = 115.000 RK2 = 10.000 RK3 =
>15.000
> Rcurr: 75.791 Rcurr-(R2+R3)/2: 14.209 MIN(Rcurr-R2,Rcurr-R3):
>14.209
> Number of restraints read = 1
>
> Done reading weight changes/NMR restraints
>
>
>
>--------- mdout.angle
>
>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber



_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Nov 05 2012 - 22:30:05 PST
Custom Search