A related question -- What is the MPI of choice of the Amber developers to use with Amber12 and GPUs?
In terms of performance and reliability.
thanks,
Sharon Shaw
-----Original Message-----
From: Robert Crovella [mailto:RCrovella.nvidia.com]
Sent: Tuesday, May 01, 2012 2:15 PM
To: 'Vijay Manickam Achari'; Amber mailing List; Jason Swails
Subject: Re: [AMBER] using two GPUs
Your troubleshooting makes it sound MPI related. I have not used MPICH2 extensively. Could you try another MPI such as OpenMPI or MVAPICH2?
-----Original Message-----
From: Vijay Manickam Achari [mailto:vjrajamany.yahoo.com]
Sent: Wednesday, April 25, 2012 7:57 PM
To: Jason Swails; Amber mailing List
Subject: Re: [AMBER] using two GPUs
This time I tried to run only 500 and 1500 steps for testing purpose. At first I used: nstlim=500, dt=0.001, ntwe=10, ntwx=10, ntpr=10, ntwr=-50 There is no problem in finish running them.
Second time I used: nstlim=1500, dt=0.001, ntwe=10, ntwx=10, ntpr=10, ntwr=-50 This time the simulation 'hang' at step 900.
Third time I used : nstlim=1500, dt=0.001, ntwe=1, ntwx=1, ntpr=1, ntwr=-50 to see how the energy changes. The simulation 'hang' at the 1016 step.
NSTEP = 1016 TIME(PS) = 1.016 TEMP(K) = 231.20 PRESS = 212.0
Etot = 51376.0779 EKtot = 11584.5283 EPtot = 39791.5496
BOND = 2538.6455 ANGLE = 10704.0520 DIHED = 3471.0102
1-4 NB = 3675.9256 1-4 EEL = 96184.8345 VDWAALS = -13063.9008
EELEC = -63719.0174 EHBOND = 0.0000 RESTRAINT = 0.0000
EKCMT = 260.1454 VIRIAL = -576.2519 VOLUME = 182698.3729
Density = 1.1881
BUT I observed there is no any drastic change in energy value.
The energy for steps below 900 were around ~43000 and it increase to around ~51000. The results are below.
NSTEP = 999 TIME(PS) = 0.999 TEMP(K) = 151.99 PRESS = -681.7
Etot = 43369.7290 EKtot = 7615.5493 EPtot = 35754.1797
BOND = 1943.1439 ANGLE = 7986.1552 DIHED = 3365.1915
1-4 NB = 3529.8495 1-4 EEL = 96172.8503 VDWAALS = -13265.6542
EELEC = -63977.3564 EHBOND = 0.0000 RESTRAINT = 0.0000
EKCMT = 120.0312 VIRIAL = 2809.6281 VOLUME = 182721.1092
Density = 1.1879
------------------------------------------------------------------------------
Setting new random velocities at step 1000
writing malto-THERMO-RT-MD01-run0100.rst_1000
NSTEP = 1000 TIME(PS) = 1.000 TEMP(K) = 152.03 PRESS = -691.1
Etot = 43369.7290 EKtot = 7617.7074 EPtot = 35752.0217
BOND = 1942.4089 ANGLE = 7980.2503 DIHED = 3364.7378
1-4 NB = 3530.5246 1-4 EEL = 96175.4581 VDWAALS = -13264.7679
EELEC = -63976.5901 EHBOND = 0.0000 RESTRAINT = 0.0000
EKCMT = 119.8069 VIRIAL = 2846.4688 VOLUME = 182718.3273
Density = 1.1880
------------------------------------------------------------------------------
NSTEP = 1001 TIME(PS) = 1.001 TEMP(K) = 308.35 PRESS = -667.8
Etot = 51461.3924 EKtot = 15450.2119 EPtot = 36011.1804
BOND = 2004.2228 ANGLE = 8159.1964 DIHED = 3368.4687
1-4 NB = 3533.7888 1-4 EEL = 96175.8826 VDWAALS = -13261.5231
EELEC = -63968.8558 EHBOND = 0.0000 RESTRAINT = 0.0000
EKCMT = 248.3719 VIRIAL = 2882.7008 VOLUME = 182715.5070
Density = 1.1880
------------------------------------------------------------------------------
For my surprise, I have run the same system for 1ns time scale using only single GPU (using pmemd.cuda command ) and I dont have any problem. I put the output below for view.
NSTEP = 500 TIME(PS) = 1.000 TEMP(K) = 150.36 PRESS = -688.0
Etot = 43593.5711 EKtot = 7534.1079 EPtot = 36059.4632
BOND = 2020.8323 ANGLE = 8176.2111 DIHED = 3368.1565
1-4 NB = 3538.7955 1-4 EEL = 96200.9385 VDWAALS = -13261.4397
EELEC = -63984.0310 EHBOND = 0.0000 RESTRAINT = 0.0000
EKCMT = 120.2504 VIRIAL = 2834.7883 VOLUME = 182734.6073
Density = 1.1879
------------------------------------------------------------------------------
Setting new random velocities at step 1000
NSTEP = 1000 TIME(PS) = 2.000 TEMP(K) = 153.13 PRESS = -42.8
Etot = 43647.0169 EKtot = 7672.8738 EPtot = 35974.1431
BOND = 2044.1647 ANGLE = 8204.0971 DIHED = 3398.7548
1-4 NB = 3551.2077 1-4 EEL = 96137.6463 VDWAALS = -13276.8357
EELEC = -64084.8917 EHBOND = 0.0000 RESTRAINT = 0.0000
EKCMT = 131.0958 VIRIAL = 298.6493 VOLUME = 181465.9198
Density = 1.1962
------------------------------------------------------------------------------
NSTEP = 1500 TIME(PS) = 3.000 TEMP(K) = 225.60 PRESS = 202.3
Etot = 51174.8405 EKtot = 11304.1250 EPtot = 39870.7155
BOND = 2685.0862 ANGLE = 10388.1851 DIHED = 3563.3844
1-4 NB = 3721.5216 1-4 EEL = 96202.9891 VDWAALS = -12978.1192
EELEC = -63712.3318 EHBOND = 0.0000 RESTRAINT = 0.0000
EKCMT = 171.0075 VIRIAL = -630.2882 VOLUME = 183425.5850
Density = 1.1834
------------------------------------------------------------------------------
Even I have run the same job in the two GPUs separately one after one another using pmemd.cuda command. I dont find any no problem in running.
When I use pmemd.cuda.MPI, the simulation run for awhile and (at step 1016) it hang.
Is there any suggestions to solving this problem?
Regards
Vijay Manickam Achari
(Phd Student c/o Prof Rauzah Hashim)
Chemistry Department,
University of Malaya,
Malaysia
vjramana.gmail.com
________________________________
From: Jason Swails <jason.swails.gmail.com>
To: Vijay Manickam Achari <vjrajamany.yahoo.com>; AMBER Mailing List <amber.ambermd.org>
Sent: Thursday, 26 April 2012, 2:44
Subject: Re: [AMBER] using two GPUs
How long were you trying to run? My suggestion is to run shorter simulations, printing each step (for starters), and if you can narrow down the problem. In my experience, an infinite hang is impossible for even the most knowledgeable people to debug without a reproducible case.
HTH,
Jason
On Wed, Apr 25, 2012 at 2:41 PM, Vijay Manickam Achari <vjrajamany.yahoo.com> wrote:
Dear Jason,
>
>Thank you so much for your reply.
>
>This time I tried with your suggestion and it worked BUT the run just hang after few steps (100). Here I put the output file that I got.
>
>***********************************************************************************
>
>
> -------------------------------------------------------
> Amber 12 SANDER 2012
> -------------------------------------------------------
>
>| PMEMD implementation of SANDER, Release 12
>
>| Run on 04/26/2012 at 02:35:03
>
> [-O]verwriting output
>
>File Assignments:
>| MDIN: MD-betaMalto-THERMO.in
>| MDOUT: malto-THERMO-RT-MD00-run1000.out
>| INPCRD: betaMalto-THERMO-MD03-run0100.rst.1
>| PARM: malto-THERMO.top
>| RESTRT: malto-THERMO-RT-MD01-run0100.rst
>| REFC: refc
>| MDVEL: mdvel
>| MDEN: mden
>| MDCRD: malto-THERMO-RT-MD00-run1000.traj
>| MDINFO: mdinfo
>|LOGFILE: logfile
>
>
> Here is the input file:
>
>Dynamic Simulation with Constant Pressure
> &cntrl
> imin=0,
> irest=1, ntx=5,
> ntxo=1, iwrap=1, nscm=2000,
> ntt=2,
> tempi = 300.0, temp0=300.0, tautp=2.0,
> ntp=2, ntb=2, taup=2.0,
> ntc=2, ntf=2,
> nstlim=100000, dt=0.001,
> ntwe=100, ntwx=100, ntpr=100, ntwr=-50000,
> ntr=0,
> cut = 9
> /
>
>
>
>
>|--------------------- INFORMATION ----------------------
>| GPU (CUDA) Version of PMEMD in use: NVIDIA GPU IN USE.
>| Version 12.0
>|
>| 03/19/2012
>|
>| Implementation by:
>| Ross C. Walker (SDSC)
>| Scott Le Grand (nVIDIA)
>| Duncan Poole (nVIDIA)
>|
>| CAUTION: The CUDA code is currently experimental.
>| You use it at your own risk. Be sure to
>| check ALL results carefully.
>|
>| Precision model in use:
>| [SPDP] - Hybrid Single/Double Precision (Default).
>|
>|--------------------------------------------------------
>
>|------------------- GPU DEVICE INFO --------------------
>|
>| Task ID: 0
>| CUDA Capable Devices Detected: 2
>| CUDA Device ID in use: 0
>| CUDA Device Name: Tesla C2075
>| CUDA Device Global Mem Size: 6143 MB
>| CUDA Device Num Multiprocessors: 14
>| CUDA Device Core Freq: 1.15 GHz
>|
>|
>| Task ID: 1
>| CUDA Capable Devices Detected: 2
>| CUDA Device ID in use: 1
>| CUDA Device Name: Tesla C2075
>| CUDA Device Global Mem Size: 6143 MB
>| CUDA Device Num Multiprocessors: 14
>| CUDA Device Core Freq: 1.15 GHz
>|
>|--------------------------------------------------------
>
>
>| Conditional Compilation Defines Used:
>| DIRFRC_COMTRANS
>| DIRFRC_EFS
>| DIRFRC_NOVEC
>| MPI
>| PUBFFT
>| FFTLOADBAL_2PROC
>| BINTRAJ
>| CUDA
>
>| Largest sphere to fit in unit cell has radius = 23.378
>
>| New format PARM file being parsed.
>| Version = 1.000 Date = 07/07/08 Time = 10:50:18
>
>| Note: 1-4 EEL scale factors were NOT found in the topology file.
>| Using default value of 1.2.
>
>| Note: 1-4 VDW scale factors were NOT found in the topology file.
>| Using default value of 2.0.
>| Duplicated 0 dihedrals
>
>| Duplicated 0 dihedrals
>
>--------------------------------------------------------------------------------
> 1. RESOURCE USE:
>--------------------------------------------------------------------------------
>
> getting new box info from bottom of inpcrd
>
> NATOM = 20736 NTYPES = 7 NBONH = 11776 MBONA = 9216
> NTHETH = 27648 MTHETA = 12032 NPHIH = 45312 MPHIA = 21248
> NHPARM = 0 NPARM = 0 NNB = 119552 NRES = 256
> NBONA = 9216 NTHETA = 12032 NPHIA = 21248 NUMBND = 7
> NUMANG = 14 NPTRA = 20 NATYP = 7 NPHB = 0
> IFBOX = 1 NMXRS = 81 IFCAP = 0 NEXTRA = 0
> NCOPY = 0
>
>| Coordinate Index Table dimensions: 13 8 9
>| Direct force subcell size = 5.9091 5.8446 5.8286
>
> BOX TYPE: RECTILINEAR
>
>--------------------------------------------------------------------------------
> 2. CONTROL DATA FOR THE RUN
>--------------------------------------------------------------------------------
>
>
>
>General flags:
> imin = 0, nmropt = 0
>
>Nature and format of input:
> ntx = 5, irest = 1, ntrx = 1
>
>Nature and format of output:
> ntxo = 1, ntpr = 100, ntrx = 1, ntwr = -50000
> iwrap = 1, ntwx = 100, ntwv = 0, ntwe = 100
> ioutfm = 0, ntwprt = 0, idecomp = 0, rbornstat= 0
>
>Potential function:
> ntf = 2, ntb = 2, igb = 0, nsnb = 25
> ipol = 0, gbsa = 0, iesp = 0
> dielc = 1.00000, cut = 9.00000, intdiel = 1.00000
>
>Frozen or restrained atoms:
> ibelly = 0, ntr = 0
>
>Molecular dynamics:
> nstlim = 100000, nscm = 2000, nrespa = 1
> t = 0.00000, dt = 0.00100, vlimit = -1.00000
>
>Anderson (strong collision) temperature regulation:
> ig = 71277, vrand = 1000
> temp0 = 300.00000, tempi = 300.00000
>
>Pressure regulation:
> ntp = 2
> pres0 = 1.00000, comp = 44.60000, taup = 2.00000
>
>SHAKE:
> ntc = 2, jfastw = 0
> tol = 0.00001
>
>| Intermolecular bonds treatment:
>| no_intermolecular_bonds = 1
>
>| Energy averages sample interval:
>| ene_avg_sampling = 100
>
>Ewald parameters:
> verbose = 0, ew_type = 0, nbflag = 1, use_pme = 1
> vdwmeth = 1, eedmeth = 1, netfrc = 1
> Box X = 76.818 Box Y = 46.757 Box Z = 52.457
> Alpha = 90.000 Beta = 90.000 Gamma = 90.000
> NFFT1 = 80 NFFT2 = 48 NFFT3 = 56
> Cutoff= 9.000 Tol =0.100E-04
> Ewald Coefficient = 0.30768
> Interpolation order = 4
>
>| PMEMD ewald parallel performance parameters:
>| block_fft = 0
>| fft_blk_y_divisor = 2
>| excl_recip = 0
>| excl_master = 0
>| atm_redist_freq = 320
>
>--------------------------------------------------------------------------------
> 3. ATOMIC COORDINATES AND VELOCITIES
>--------------------------------------------------------------------------------
>
>trajectory generated by ptraj
> begin time read from input coords = 0.000 ps
>
>
> Number of triangulated 3-point waters found: 0
>
> Sum of charges from parm topology file = 0.00000000
> Forcing neutrality...
>
>| Dynamic Memory, Types Used:
>| Reals 1291450
>| Integers 3021916
>
>| Nonbonded Pairs Initial Allocation: 3136060
>
>| GPU memory information:
>| KB of GPU memory in use: 146374
>| KB of CPU memory in use: 30984
>
>| Running AMBER/MPI version on 2 nodes
>
>
>--------------------------------------------------------------------------------
> 4. RESULTS
>--------------------------------------------------------------------------------
>
> ---------------------------------------------------
> APPROXIMATING switch and d/dx switch using CUBIC SPLINE INTERPOLATION
> using 5000.0 points per unit in tabled values
> TESTING RELATIVE ERROR over r ranging from 0.0 to cutoff
>| CHECK switch(x): max rel err = 0.2738E-14 at 2.422500
>| CHECK d/dx switch(x): max rel err = 0.8314E-11 at 2.736960
> ---------------------------------------------------
>|---------------------------------------------------
>| APPROXIMATING direct energy using CUBIC SPLINE INTERPOLATION
>| with 50.0 points per unit in tabled values
>| Relative Error Limit not exceeded for r .gt. 2.39
>| APPROXIMATING direct force using CUBIC SPLINE INTERPOLATION
>| with 50.0 points per unit in tabled values
>| Relative Error Limit not exceeded for r .gt. 2.84
>|---------------------------------------------------
>
> NSTEP = 100 TIME(PS) = 0.100 TEMP(K) = 147.84 PRESS = -2761.2
> Etot = 43456.9441 EKtot = 7407.5049 EPtot = 36049.4392
> BOND = 2118.3308 ANGLE = 7884.5560 DIHED = 3309.5527
> 1-4 NB = 3447.1834 1-4 EEL = 95835.1134 VDWAALS = -13101.7433
> EELEC = -63443.5538 EHBOND = 0.0000 RESTRAINT = 0.0000
> EKCMT = 110.3519 VIRIAL = 11263.9500 VOLUME = 187084.9335
> Density = 1.1602
> ------------------------------------------------------------------------------
>
>***************************************************************************************
>
>The run just hang and there is no progress at all.
>Does the command still need any other input?
>
>Regards
>
>
>Vijay Manickam Achari
>(Phd Student c/o Prof Rauzah Hashim)
>Chemistry Department,
>University of Malaya,
>Malaysia
>vjramana.gmail.com
>
>
>________________________________
> From: Jason Swails <jason.swails.gmail.com>
>To: AMBER Mailing List <amber.ambermd.org>
>Sent: Thursday, 26 April 2012, 0:05
>Subject: Re: [AMBER] using two GPUs
>
>Hello,
>
>On Wed, Apr 25, 2012 at 1:34 AM, Vijay Manickam Achari <vjrajamany.yahoo.com
>> wrote:
>
>> Thank you for the kind reply.
>> I have tried to figure out based on your info and other sources as well to
>> get the two GPUs work.
>>
>> For the machinefile:
>> I checked /dev folder and I saw list of NVIDIA card names as :- nvidia0,
>> nvidia1, nvidia2, nvidia3, nvidia4. I understand these names should be
>> listed in the mahcinefile. I comment out nvidia0, nvidia1, nvidia2 since I
>> only wanted to use two GPUs.
>>
>
>The names in the hostfile (or machinefile) are the host name (you can get
>this via "hostname"). However, machinefiles are really only necessary if
>you plan on going off-node. What it tells the MPI is *where* on the
>network each thread should be launched.
>
>If you want to run everything locally on the same machine, every MPI
>implementation that I've ever used allows you to just say:
>
>mpirun -np 2 pmemd.cuda.MPI -O -i mdin ...etc.
>
>If you need to use the hostfile or machinefile, look at the mpirun manpage
>to see how your particular MPI reads them.
>
>HTH,
>Jason
>
>--
>Jason M. Swails
>Quantum Theory Project,
>University of Florida
>Ph.D. Candidate
>352-392-4032
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber
>
--
Jason M. Swails
Quantum Theory Project,
University of Florida
Ph.D. Candidate
352-392-4032
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue May 01 2012 - 12:30:03 PDT