Re: [AMBER] ERROR: GPU runs fail with nfft Error only when running 2x geforce TITANS in same machine from Scott Le Grand on 2013-06-20 (Amber Archive Jun 2013)

From: Scott Le Grand <varelse2005.gmail.com>
Date: Thu, 20 Jun 2013 12:02:23 -0700

Sounds like the Titan is freaking out actually...

Sigh...

On Thu, Jun 20, 2013 at 11:42 AM, ET <sketchfoot.gmail.com> wrote:

> Hi Ross,
>
> What you say makes sense. The error does occur at the start of the
> production segment. However, IMO it is GPU related as if you look in the
> previous segment that generated the restart file for the segment that
> failed, there are ******'s written in the outfile.
>
> Restarting the run with the nfft error inevitably fails, but restarting
> from the previous run works fine in all the cases I have tried so far. So
> the error that lead to NAN in the co-ordinate file only happens
> intermittently.
>
> I can play the trajectories fine so the my input parameters are fine. It's
> just that along the way the written co-ordinates get messed up. This occurs
> with very high frequency (i.e.inevitable) in my dual GPU-setup. The problem
> happens very rarely (but still occurs) in single-GPU.
>
> What do you think?
>
>
> br,
> g
>
>
>
>
> On 20 June 2013 17:37, Ross Walker <ross.rosswalker.co.uk> wrote:
>
> > Hi ET,
> >
> > This cannot possibly be bandwidth limited. This error is triggered in the
> > CPU code (vanilla Fortran) long before any GPU calculations are fired up.
> > It is an initial check by the CPU at the time it is reading in the
> > coordinates. Have you tried this with the CPU version of PMEMD? What does
> > that report?
> >
> > All the best
> > Ross
> >
> >
> >
> > On 6/20/13 7:56 AM, "ET" <sketchfoot.gmail.com> wrote:
> >
> > >Hi Scott,
> > >
> > >Hmmm. I'm thinking this is maybe a bandwidth issue. The problem occurs
> > >when
> > >running the card in dual GPU-config and the fail in the single-GPU mode
> > >occurred on a machine with strangely low results.
> > >
> > >If I load the other card back into dual mode, I'll run the bandwidthtest
> > >again. The slots should both have x16 bandwidth even if both are
> > >populated.
> > >Do you think an unusual number of peripherals such as HDDs will make a
> > >difference?
> > >
> > >Also you say that it is not CUDA code or the GPU. This isn't an OS error
> > >is
> > >it? Or is it AMBER reporting an underlying OS error?
> > >
> > >
> > >##########################################
> > >
> > >[CUDA Bandwidth Test] - Starting...
> > >Running on...
> > >
> > > Device 0: GeForce GTX TITAN
> > > Quick Mode
> > >
> > > Host to Device Bandwidth, 1 Device(s)
> > > PINNED Memory Transfers
> > > Transfer Size (Bytes) Bandwidth(MB/s)
> > > 33554432 3930.7
> > >
> > > Device to Host Bandwidth, 1 Device(s)
> > > PINNED Memory Transfers
> > > Transfer Size (Bytes) Bandwidth(MB/s)
> > > 33554432 2100.4
> > >
> > > Device to Device Bandwidth, 1 Device(s)
> > > PINNED Memory Transfers
> > > Transfer Size (Bytes) Bandwidth(MB/s)
> > > 33554432 220731.1
> > >
> > >
> > >
> > >br,
> > >g
> > >
> > >
> > >On 19 June 2013 18:15, Scott Le Grand <varelse2005.gmail.com> wrote:
> > >
> > >> | ERROR: nfft1 must be in the range of 6 to 512!
> > >> | ERROR: nfft2 must be in the range of 6 to 512!
> > >> | ERROR: nfft3 must be in the range of 6 to 512!
> > >> | ERROR: a must be in the range of 0.10000E+01 to 0.10000E+04!
> > >> | ERROR: b must be in the range of 0.10000E+01 to 0.10000E+04!
> > >> | ERROR: c must be in the range of 0.10000E+01 to 0.10000E+04!
> > >>
> > >> That's not the CUDA code or the GPU. That's something *very* bad with
> > >>your
> > >> machine having somehow convinced itself of something even worse. No
> > >>idea
> > >> what though
> > >>
> > >>
> > >>
> > >>
> > >> On Mon, Jun 17, 2013 at 10:24 PM, ET <sketchfoot.gmail.com> wrote:
> > >>
> > >> > ps. The machine is running in headless mode on centos 6
> > >> >
> > >> >
> > >> > #### bandwitdh test for currently installed TITAN-b:
> > >> >
> > >> > [CUDA Bandwidth Test] - Starting...
> > >> > Running on...
> > >> >
> > >> > Device 0: GeForce GTX TITAN
> > >> > Quick Mode
> > >> >
> > >> > Host to Device Bandwidth, 1 Device(s)
> > >> > PINNED Memory Transfers
> > >> > Transfer Size (Bytes) Bandwidth(MB/s)
> > >> > 33554432 6002.5
> > >> >
> > >> > Device to Host Bandwidth, 1 Device(s)
> > >> > PINNED Memory Transfers
> > >> > Transfer Size (Bytes) Bandwidth(MB/s)
> > >> > 33554432 6165.5
> > >> >
> > >> > Device to Device Bandwidth, 1 Device(s)
> > >> > PINNED Memory Transfers
> > >> > Transfer Size (Bytes) Bandwidth(MB/s)
> > >> > 33554432 220723.8
> > >> >
> > >> >
> > >> >
> > >> > ### deviceQuery
> > >> >
> > >> > deviceQuery Starting...
> > >> >
> > >> > CUDA Device Query (Runtime API) version (CUDART static linking)
> > >> >
> > >> > Detected 1 CUDA Capable device(s)
> > >> >
> > >> > Device 0: "GeForce GTX TITAN"
> > >> > CUDA Driver Version / Runtime Version 5.5 / 5.0
> > >> > CUDA Capability Major/Minor version number: 3.5
> > >> > Total amount of global memory: 6143 MBytes
> > >>(6441730048
> > >> > bytes)
> > >> > (14) Multiprocessors x (192) CUDA Cores/MP: 2688 CUDA Cores
> > >> > GPU Clock rate: 928 MHz (0.93 GHz)
> > >> > Memory Clock rate: 3004 Mhz
> > >> > Memory Bus Width: 384-bit
> > >> > L2 Cache Size: 1572864 bytes
> > >> > Max Texture Dimension Size (x,y,z) 1D=(65536),
> > >> > 2D=(65536,65536), 3D=(4096,4096,4096)
> > >> > Max Layered Texture Size (dim) x layers 1D=(16384) x 2048,
> > >> > 2D=(16384,16384) x 2048
> > >> > Total amount of constant memory: 65536 bytes
> > >> > Total amount of shared memory per block: 49152 bytes
> > >> > Total number of registers available per block: 65536
> > >> > Warp size: 32
> > >> > Maximum number of threads per multiprocessor: 2048
> > >> > Maximum number of threads per block: 1024
> > >> > Maximum sizes of each dimension of a block: 1024 x 1024 x 64
> > >> > Maximum sizes of each dimension of a grid: 2147483647 x 65535
> x
> > >> 65535
> > >> > Maximum memory pitch: 2147483647 bytes
> > >> > Texture alignment: 512 bytes
> > >> > Concurrent copy and kernel execution: Yes with 1 copy
> > >> engine(s)
> > >> > Run time limit on kernels: No
> > >> > Integrated GPU sharing Host Memory: No
> > >> > Support host page-locked memory mapping: Yes
> > >> > Alignment requirement for Surfaces: Yes
> > >> > Device has ECC support: Disabled
> > >> > Device supports Unified Addressing (UVA): Yes
> > >> > Device PCI Bus ID / PCI location ID: 3 / 0
> > >> > Compute Mode:
> > >> > < Exclusive Process (many threads in one process is able to use
> > >> > ::cudaSetDevice() with this device) >
> > >> >
> > >> > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA
> > >> Runtime
> > >> > Version = 5.0, NumDevs = 1, Device0 = GeForce GTX TITAN
> > >> >
> > >> >
> > >> >
> > >> > On 18 June 2013 06:21, ET <sketchfoot.gmail.com> wrote:
> > >> >
> > >> > > Hi,
> > >> > >
> > >> > > I am trying to run NPT simulations using pmemd.cuda using TITAN
> > >> graphics
> > >> > > cards. The equilibration & steps were completed with the CPU
> > >>version
> > >> of
> > >> > > sander.
> > >> > >
> > >> > > I have 2x EVGA superclocked TITAN cards.There have been problems
> > >>with
> > >> the
> > >> > > TITAN graphics cards and I RMA'd one. I have benchmarked both the
> > >>cards
> > >> > > after the RMA and determined that they have no obvious problems
> that
> > >> > would
> > >> > > warrant them being RMA'd again. Though there is an issue with the
> > >>AMBER
> > >> > > cuda code and TITANs in general as discussed in the following
> > >>thread:
> > >> > >
> > >> > > < sorry, can't find it, but it's a ~200 posts long and titled:
> > >> > experiences
> > >> > > with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in
> > >>Linux
> > >> ?
> > >> > >
> > >> > >
> > >> > > As I'm not sure whether this is the same issue, I'm posting this
> in
> > >>a
> > >> new
> > >> > > thread.
> > >> > >
> > >> > > I began running 12 100ns production run using TITAN-a. There were
> no
> > >> > > problems. After waiting for and testing the replacement card
> > >> (TITAN-b), I
> > >> > > put that into the machine as well. So both cards were working on
> > >> > finishing
> > >> > > the total of 300 segments.
> > >> > >
> > >> > > Very shortly, all the segments had failed, though the cards still
> > >> showed
> > >> > a
> > >> > > 100% utilisation and I did not realise until I checked the
> outfiles
> > >> which
> > >> > > showed "ERROR: nfft1 must be in the range of blah, blah, blah"
> > >> (error
> > >> > > posted below). This was pretty weird as I am used to the jobs
> > >>visibly
> > >> > > failing and not carrying on eating resources, whilst doing
> nothing.
> > >> > >
> > >> > > So I pulled the TITAN-a out and restarted the calculations with
> > >>TITAN-b
> > >> > > from the last good rst. Usually 2 back. There have been no
> problems
> > >>at
> > >> > all
> > >> > > and all the simulations have completed.
> > >> > >
> > >> > > My hardware specs are:
> > >> > > Gigabyte GA-X58-UD7 mobo
> > >> > > i7-930 processor
> > >> > > 6GB RAM
> > >> > > 1200 Watt Bequiet power supply
> > >> > >
> > >> > >
> > >> > >
> > >> > > Does anyone have any idea as to what's going on?
> > >> > >
> > >> > >
> > >> > > br,
> > >> > > g
> > >> > >
> > >> > > ############################################################
> > >> > > ############################################################
> > >> > > -------------------------------------------------------
> > >> > > Amber 12 SANDER 2012
> > >> > > -------------------------------------------------------
> > >> > >
> > >> > > | PMEMD implementation of SANDER, Release 12
> > >> > >
> > >> > > | Run on 06/09/2013 at 16:26:10
> > >> > >
> > >> > > [-O]verwriting output
> > >> > >
> > >> > > File Assignments:
> > >> > > | MDIN: prod.in
> > >> > >
> > >> > > | MDOUT: md_4.out
> > >> > >
> > >> > > | INPCRD: md_3.rst
> > >> > >
> > >> > > | PARM: ../leap/TMC_I54V-V82S_Complex_25.parm
> > >> > >
> > >> > > | RESTRT: md_4.rst
> > >> > >
> > >> > > | REFC: refc
> > >> > >
> > >> > > | MDVEL: mdvel
> > >> > >
> > >> > > | MDEN: mden
> > >> > >
> > >> > > | MDCRD: md_4.ncdf
> > >> > >
> > >> > > | MDINFO: mdinfo
> > >> > >
> > >> > >
> > >> > >
> > >> > > Here is the input file:
> > >> > >
> > >> > > Constant pressure constant temperature production run
> > >> > >
> > >> > > &cntrl
> > >> > >
> > >> > > nstlim=2000000, dt=0.002, ntx=5, irest=1, ntpr=250, ntwr=1000,
> > >> > ntwx=500,
> > >> > >
> > >> > > temp0=300.0, ntt=1, tautp=2.0, ioutfm=1, ig=-1, ntxo=2,
> > >> > >
> > >> > >
> > >> > >
> > >> > > ntb=2, ntp=1,
> > >> > >
> > >> > >
> > >> > >
> > >> > > ntc=2, ntf=2,
> > >> > >
> > >> > >
> > >> > >
> > >> > > nrespa=1,
> > >> > >
> > >> > > &end
> > >> > >
> > >> > >
> > >> > >
> > >> > > Note: ig = -1. Setting random seed based on wallclock time in
> > >> > microseconds.
> > >> > >
> > >> > > |--------------------- INFORMATION ----------------------
> > >> > > | GPU (CUDA) Version of PMEMD in use: NVIDIA GPU IN USE.
> > >> > > | Version 12.3
> > >> > > |
> > >> > > | 04/24/2013
> > >> > > |
> > >> > > | Implementation by:
> > >> > > | Ross C. Walker (SDSC)
> > >> > > | Scott Le Grand (nVIDIA)
> > >> > > | Duncan Poole (nVIDIA)
> > >> > > |
> > >> > > | CAUTION: The CUDA code is currently experimental.
> > >> > > | You use it at your own risk. Be sure to
> > >> > > | check ALL results carefully.
> > >> > > |
> > >> > > | Precision model in use:
> > >> > > | [SPFP] - Mixed Single/Double/Fixed Point Precision.
> > >> > > | (Default)
> > >> > > |
> > >> > > |--------------------------------------------------------
> > >> > >
> > >> > > |----------------- CITATION INFORMATION -----------------
> > >> > > |
> > >> > > | When publishing work that utilized the CUDA version
> > >> > > | of AMBER, please cite the following in addition to
> > >> > > | the regular AMBER citations:
> > >> > > |
> > >> > > | - Romelia Salomon-Ferrer; Andreas W. Goetz; Duncan
> > >> > > | Poole; Scott Le Grand; Ross C. Walker "Routine
> > >> > > | microsecond molecular dynamics simulations with
> > >> > > | AMBER - Part II: Particle Mesh Ewald", J. Chem.
> > >> > > | Theory Comput., 2012, (In review).
> > >> > > |
> > >> > > | - Andreas W. Goetz; Mark J. Williamson; Dong Xu;
> > >> > > | Duncan Poole; Scott Le Grand; Ross C. Walker
> > >> > > | "Routine microsecond molecular dynamics simulations
> > >> > > | with AMBER - Part I: Generalized Born", J. Chem.
> > >> > > | Theory Comput., 2012, 8 (5), pp1542-1555.
> > >> > > |
> > >> > > | - Scott Le Grand; Andreas W. Goetz; Ross C. Walker
> > >> > > | "SPFP: Speed without compromise - a mixed precision
> > >> > > | model for GPU accelerated molecular dynamics
> > >> > > | simulations.", Comp. Phys. Comm., 2013, 184
> > >> > > | pp374-380, DOI: 10.1016/j.cpc.2012.09.022
> > >> > > |
> > >> > > |--------------------------------------------------------
> > >> > >
> > >> > > |------------------- GPU DEVICE INFO --------------------
> > >> > > |
> > >> > > | CUDA Capable Devices Detected: 2
> > >> > > | CUDA Device ID in use: 0
> > >> > > | CUDA Device Name: GeForce GTX TITAN
> > >> > > | CUDA Device Global Mem Size: 6143 MB
> > >> > > | CUDA Device Num Multiprocessors: 14
> > >> > > | CUDA Device Core Freq: 0.93 GHz
> > >> > > |
> > >> > > |--------------------------------------------------------
> > >> > >
> > >> > > | ERROR: nfft1 must be in the range of 6 to 512!
> > >> > > | ERROR: nfft2 must be in the range of 6 to 512!
> > >> > > | ERROR: nfft3 must be in the range of 6 to 512!
> > >> > > | ERROR: a must be in the range of 0.10000E+01 to 0.10000E+04!
> > >> > > | ERROR: b must be in the range of 0.10000E+01 to 0.10000E+04!
> > >> > > | ERROR: c must be in the range of 0.10000E+01 to 0.10000E+04!
> > >> > >
> > >> > > Input errors occurred. Terminating execution.
> > >> > > ############################################################
> > >> > > ############################################################
> > >> > >
> > >> > _______________________________________________
> > >> > AMBER mailing list
> > >> > AMBER.ambermd.org
> > >> > http://lists.ambermd.org/mailman/listinfo/amber
> > >> >
> > >> _______________________________________________
> > >> AMBER mailing list
> > >> AMBER.ambermd.org
> > >> http://lists.ambermd.org/mailman/listinfo/amber
> > >>
> > >_______________________________________________
> > >AMBER mailing list
> > >AMBER.ambermd.org
> > >http://lists.ambermd.org/mailman/listinfo/amber
> >
> >
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jun 20 2013 - 12:30:02 PDT