Re: [AMBER] REMD jobs crashing repeatedly on GPUs

From: Scott Brozell <sbrozell.rci.rutgers.edu>
Date: Fri, 27 Feb 2015 12:33:03 -0500

Hi,

What was the story regarding this old thread ?

This uncommon gpu error has been reported again:
Running Steered MD calculations (20 sample run) with AMBER14
using pmemd.cuda.MPI failed after few steps:

   cudaMemcpy GpuBuffer::Upload failed unknown error


/usr/local/cuda/5.0.35/1_Utilities/deviceQuery/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "Tesla M2070"
  CUDA Driver Version / Runtime Version 6.5 / 5.0
  CUDA Capability Major/Minor version number: 2.0
  Total amount of global memory: 5375 MBytes (5636554752 bytes)
  (14) Multiprocessors x ( 32) CUDA Cores/MP: 448 CUDA Cores
  GPU Clock rate: 1147 MHz (1.15 GHz)
  Memory Clock rate: 1566 Mhz
  Memory Bus Width: 384-bit
  L2 Cache Size: 786432 bytes
  Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
  Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
  Total amount of constant memory: 65536 bytes
  Total amount of shared memory per block: 49152 bytes
  Total number of registers available per block: 32768
  Warp size: 32
  Maximum number of threads per multiprocessor: 1536
  Maximum number of threads per block: 1024
  Maximum sizes of each dimension of a block: 1024 x 1024 x 64
  Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
  Maximum memory pitch: 2147483647 bytes
  Texture alignment: 512 bytes
  Concurrent copy and kernel execution: Yes with 2 copy engine(s)
  Run time limit on kernels: No
  Integrated GPU sharing Host Memory: No
  Support host page-locked memory mapping: Yes
  Alignment requirement for Surfaces: Yes
  Device has ECC support: Enabled
  Device supports Unified Addressing (UVA): Yes
  Device PCI Bus ID / PCI location ID: 20 / 0
  Compute Mode:
     < Exclusive (only one host thread in one process is able to use ::cudaSetDevice() with this device) >

Device 1: "Tesla M2070"
  CUDA Driver Version / Runtime Version 6.5 / 5.0
  CUDA Capability Major/Minor version number: 2.0
  Total amount of global memory: 5375 MBytes (5636554752 bytes)
  (14) Multiprocessors x ( 32) CUDA Cores/MP: 448 CUDA Cores
  GPU Clock rate: 1147 MHz (1.15 GHz)
  Memory Clock rate: 1566 Mhz
  Memory Bus Width: 384-bit
  L2 Cache Size: 786432 bytes
  Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
  Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
  Total amount of constant memory: 65536 bytes
  Total amount of shared memory per block: 49152 bytes
  Total number of registers available per block: 32768
  Warp size: 32
  Maximum number of threads per multiprocessor: 1536
  Maximum number of threads per block: 1024
  Maximum sizes of each dimension of a block: 1024 x 1024 x 64
  Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
  Maximum memory pitch: 2147483647 bytes
  Texture alignment: 512 bytes
  Concurrent copy and kernel execution: Yes with 2 copy engine(s)
  Run time limit on kernels: No
  Integrated GPU sharing Host Memory: No
  Support host page-locked memory mapping: Yes
  Alignment requirement for Surfaces: Yes
  Device has ECC support: Enabled
  Device supports Unified Addressing (UVA): Yes
  Device PCI Bus ID / PCI location ID: 17 / 0
  Compute Mode:
     < Exclusive (only one host thread in one process is able to use ::cudaSetDevice() with this device) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.5, CUDA Runtime Version = 5.0, NumDevs = 2, Device0 = Tesla M2070, Device1 = Tesla M2070

On Fri, Sep 06, 2013 at 11:13:27AM -0700, Scott Le Grand wrote:
> One flaky GPU in the mix is more than enough to trigger this.
>
> On Fri, Sep 6, 2013 at 11:07 AM, Christina Bergonzo <cbergonzo.gmail.com>wrote:
> > I saw similar problems and have been working with someone from Keeneland to
> > sort it out. I forwarded this to him.
> > It may be a Keeneland issue and not a problem with the code since we're
> > able to run this elsewhere.
> >
> > -Christina
> >
> >
> > On Fri, Sep 6, 2013 at 11:57 AM, Joe Passman <joepassman.comcast.net>
> > > I am running a series of replica exchange jobs using AMBER12 on
> > Keeneland.
> > > The jobs crash repeatedly. The error ALWAYS comes at different times
> > during
> > > the simulation. I am not sure what the issue is.
> > >
> > >
> > > Here is an excerpt of an example output file:
> > >
> > > ===================
> > > started run 240
> > >
> > > Running multipmemd version of pmemd Amber12
> > > Total processors = 30
> > > Number of groups = 30
> > >
> > >
> > > Running multipmemd version of pmemd Amber12
> > > Total processors = 30
> > > Number of groups = 30
> > >
> > >
> > > Running multipmemd version of pmemd Amber12
> > > Total processors = 30
> > > Number of groups = 30
> > >
> > > started run 241
> > >
> > > Running multipmemd version of pmemd Amber12
> > > Total processors = 30
> > > Number of groups = 30
> > >
> > > UNKNOWN
> > > cudaMemcpyToSymbol: SetSim copy to cSim failed unknown error
> > > UNKNOWN
> > > Error: unknown error launching kernel kCalculatePMENonbondForces
> > > UNKNOWN
> > > cudaMemcpy GpuBuffer::Upload failed unknown error
> > > UNKNOWN
> > > cudaMalloc GpuBuffer::Allocate failed unknown error
> > > UNKNOWN
> > > cudaMemcpy GpuBuffer::Upload failed unknown error
> > > UNKNOWN
> > > cudaMemcpy GpuBuffer::Upload failed unknown error
> > > UNKNOWN
> > > cudaMemcpy GpuBuffer::Upload failed unknown error
> > > UNKNOWN
> > > cudaMemcpyToSymbol: SetSim copy to cSim failed unknown error
> > > UNKNOWN
> > > Error: unknown error launching kernel kCalculatePMENonbondForces
> > > UNKNOWN
> > > Error: unknown error launching kernel kCalculatePMENonbondForces
> > > UNKNOWN
> > > Error: unknown error launching kernel kUpdate
> > > UNKNOWN
> > > Error: unknown error launching kernel kCalculatePMENonbondForces
> > > UNKNOWN
> > > Error: unknown error launching kernel kUpdate
> > > UNKNOWN
> > > gpu_download_partial_forces: download failed unknown error
> > > UNKNOWN
> > > cudaMemcpy GpuBuffer::Upload failed unknown error
> > > UNKNOWN
> > > cudaMalloc GpuBuffer::Allocate failed unknown error
> > > UNKNOWN
> > > UNKNOWN
> > > Error: unknown error launching kernel kUpdate
> > >
> > > Running multipmemd version of pmemd Amber12
> > > Total processors = 30
> > > Number of groups = 30
> > >
> > > UNKNOWN
> > > UNKNOWN
> > > UNKNOWN
> > > ===================
> > >
> > >
> > > I have been talking to Shiquan Su at Keeneland. He found
> > > this seemingly relevant thread from year 2010.
> > >
> > > http://archive.ambermd.org/ 201009/0180.html
> > >
> > >
> > > Does anyone have an idea what is happening here?

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Feb 27 2015 - 10:00:02 PST
Custom Search