Re: [AMBER] REMD jobs crashing repeatedly on GPUs

From: Christina Bergonzo <cbergonzo.gmail.com>
Date: Wed, 4 Mar 2015 11:19:35 -0700

Hi,

When running M-REMD on multiple GPUs where a pre-release Amber14 was
compiled using intel compilers + cuda 5.0, I would see random crashes
though the test cases all passed and the system ran fine on other machines
(Blue Waters).
As far as I can tell, Scott LeGrand had determined that it was due to an
Intel compiler bug. The errors were addressed by using Amber14 (patched) +
GCC compilers + cuda 5.0.

-Christina

On Fri, Feb 27, 2015 at 10:33 AM, Scott Brozell <sbrozell.rci.rutgers.edu>
wrote:

> Hi,
>
> What was the story regarding this old thread ?
>
> This uncommon gpu error has been reported again:
> Running Steered MD calculations (20 sample run) with AMBER14
> using pmemd.cuda.MPI failed after few steps:
>
> cudaMemcpy GpuBuffer::Upload failed unknown error
>
>
> /usr/local/cuda/5.0.35/1_Utilities/deviceQuery/deviceQuery Starting...
>
> CUDA Device Query (Runtime API) version (CUDART static linking)
>
> Detected 2 CUDA Capable device(s)
>
> Device 0: "Tesla M2070"
> CUDA Driver Version / Runtime Version 6.5 / 5.0
> CUDA Capability Major/Minor version number: 2.0
> Total amount of global memory: 5375 MBytes (5636554752
> bytes)
> (14) Multiprocessors x ( 32) CUDA Cores/MP: 448 CUDA Cores
> GPU Clock rate: 1147 MHz (1.15 GHz)
> Memory Clock rate: 1566 Mhz
> Memory Bus Width: 384-bit
> L2 Cache Size: 786432 bytes
> Max Texture Dimension Size (x,y,z) 1D=(65536),
> 2D=(65536,65535), 3D=(2048,2048,2048)
> Max Layered Texture Size (dim) x layers 1D=(16384) x 2048,
> 2D=(16384,16384) x 2048
> Total amount of constant memory: 65536 bytes
> Total amount of shared memory per block: 49152 bytes
> Total number of registers available per block: 32768
> Warp size: 32
> Maximum number of threads per multiprocessor: 1536
> Maximum number of threads per block: 1024
> Maximum sizes of each dimension of a block: 1024 x 1024 x 64
> Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
> Maximum memory pitch: 2147483647 bytes
> Texture alignment: 512 bytes
> Concurrent copy and kernel execution: Yes with 2 copy engine(s)
> Run time limit on kernels: No
> Integrated GPU sharing Host Memory: No
> Support host page-locked memory mapping: Yes
> Alignment requirement for Surfaces: Yes
> Device has ECC support: Enabled
> Device supports Unified Addressing (UVA): Yes
> Device PCI Bus ID / PCI location ID: 20 / 0
> Compute Mode:
> < Exclusive (only one host thread in one process is able to use
> ::cudaSetDevice() with this device) >
>
> Device 1: "Tesla M2070"
> CUDA Driver Version / Runtime Version 6.5 / 5.0
> CUDA Capability Major/Minor version number: 2.0
> Total amount of global memory: 5375 MBytes (5636554752
> bytes)
> (14) Multiprocessors x ( 32) CUDA Cores/MP: 448 CUDA Cores
> GPU Clock rate: 1147 MHz (1.15 GHz)
> Memory Clock rate: 1566 Mhz
> Memory Bus Width: 384-bit
> L2 Cache Size: 786432 bytes
> Max Texture Dimension Size (x,y,z) 1D=(65536),
> 2D=(65536,65535), 3D=(2048,2048,2048)
> Max Layered Texture Size (dim) x layers 1D=(16384) x 2048,
> 2D=(16384,16384) x 2048
> Total amount of constant memory: 65536 bytes
> Total amount of shared memory per block: 49152 bytes
> Total number of registers available per block: 32768
> Warp size: 32
> Maximum number of threads per multiprocessor: 1536
> Maximum number of threads per block: 1024
> Maximum sizes of each dimension of a block: 1024 x 1024 x 64
> Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
> Maximum memory pitch: 2147483647 bytes
> Texture alignment: 512 bytes
> Concurrent copy and kernel execution: Yes with 2 copy engine(s)
> Run time limit on kernels: No
> Integrated GPU sharing Host Memory: No
> Support host page-locked memory mapping: Yes
> Alignment requirement for Surfaces: Yes
> Device has ECC support: Enabled
> Device supports Unified Addressing (UVA): Yes
> Device PCI Bus ID / PCI location ID: 17 / 0
> Compute Mode:
> < Exclusive (only one host thread in one process is able to use
> ::cudaSetDevice() with this device) >
>
> deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.5, CUDA Runtime
> Version = 5.0, NumDevs = 2, Device0 = Tesla M2070, Device1 = Tesla M2070
>
> On Fri, Sep 06, 2013 at 11:13:27AM -0700, Scott Le Grand wrote:
> > One flaky GPU in the mix is more than enough to trigger this.
> >
> > On Fri, Sep 6, 2013 at 11:07 AM, Christina Bergonzo <cbergonzo.gmail.com
> >wrote:
> > > I saw similar problems and have been working with someone from
> Keeneland to
> > > sort it out. I forwarded this to him.
> > > It may be a Keeneland issue and not a problem with the code since we're
> > > able to run this elsewhere.
> > >
> > > -Christina
> > >
> > >
> > > On Fri, Sep 6, 2013 at 11:57 AM, Joe Passman <joepassman.comcast.net>
> > > > I am running a series of replica exchange jobs using AMBER12 on
> > > Keeneland.
> > > > The jobs crash repeatedly. The error ALWAYS comes at different times
> > > during
> > > > the simulation. I am not sure what the issue is.
> > > >
> > > >
> > > > Here is an excerpt of an example output file:
> > > >
> > > > ===================
> > > > started run 240
> > > >
> > > > Running multipmemd version of pmemd Amber12
> > > > Total processors = 30
> > > > Number of groups = 30
> > > >
> > > >
> > > > Running multipmemd version of pmemd Amber12
> > > > Total processors = 30
> > > > Number of groups = 30
> > > >
> > > >
> > > > Running multipmemd version of pmemd Amber12
> > > > Total processors = 30
> > > > Number of groups = 30
> > > >
> > > > started run 241
> > > >
> > > > Running multipmemd version of pmemd Amber12
> > > > Total processors = 30
> > > > Number of groups = 30
> > > >
> > > > UNKNOWN
> > > > cudaMemcpyToSymbol: SetSim copy to cSim failed unknown error
> > > > UNKNOWN
> > > > Error: unknown error launching kernel kCalculatePMENonbondForces
> > > > UNKNOWN
> > > > cudaMemcpy GpuBuffer::Upload failed unknown error
> > > > UNKNOWN
> > > > cudaMalloc GpuBuffer::Allocate failed unknown error
> > > > UNKNOWN
> > > > cudaMemcpy GpuBuffer::Upload failed unknown error
> > > > UNKNOWN
> > > > cudaMemcpy GpuBuffer::Upload failed unknown error
> > > > UNKNOWN
> > > > cudaMemcpy GpuBuffer::Upload failed unknown error
> > > > UNKNOWN
> > > > cudaMemcpyToSymbol: SetSim copy to cSim failed unknown error
> > > > UNKNOWN
> > > > Error: unknown error launching kernel kCalculatePMENonbondForces
> > > > UNKNOWN
> > > > Error: unknown error launching kernel kCalculatePMENonbondForces
> > > > UNKNOWN
> > > > Error: unknown error launching kernel kUpdate
> > > > UNKNOWN
> > > > Error: unknown error launching kernel kCalculatePMENonbondForces
> > > > UNKNOWN
> > > > Error: unknown error launching kernel kUpdate
> > > > UNKNOWN
> > > > gpu_download_partial_forces: download failed unknown error
> > > > UNKNOWN
> > > > cudaMemcpy GpuBuffer::Upload failed unknown error
> > > > UNKNOWN
> > > > cudaMalloc GpuBuffer::Allocate failed unknown error
> > > > UNKNOWN
> > > > UNKNOWN
> > > > Error: unknown error launching kernel kUpdate
> > > >
> > > > Running multipmemd version of pmemd Amber12
> > > > Total processors = 30
> > > > Number of groups = 30
> > > >
> > > > UNKNOWN
> > > > UNKNOWN
> > > > UNKNOWN
> > > > ===================
> > > >
> > > >
> > > > I have been talking to Shiquan Su at Keeneland. He found
> > > > this seemingly relevant thread from year 2010.
> > > >
> > > > http://archive.ambermd.org/ 201009/0180.html
> > > >
> > > >
> > > > Does anyone have an idea what is happening here?
>



-- 
---------------------------------------------------------------------------------------
Christina Bergonzo, PhD
Postdoctoral Researcher
Department of Medicinal Chemistry, University of Utah
30 South 2000 East, Rm. 201
Salt Lake City, UT 84112-5820
Office: L.S. Skaggs Pharmacy Research Institute, Rm.4290
http://home.chpc.utah.edu/~cheatham/
(801) 587-9652 / Fax: (801) 585-6208
---------------------------------------------------------------------------------------
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Mar 04 2015 - 10:30:02 PST
Custom Search