Re: [AMBER] Problem running multiple GPU's

From: <jon.maguire.louisville.edu>
Date: Wed, 10 Sep 2014 16:46:33 +0000

One of the three cards is powering a display, although it doesn't really need to. Regardless, here is the output from deviceQuery

Detected 3 CUDA Capable device(s)

Device 0: "GeForce GTX TITAN Black"
  CUDA Driver Version / Runtime Version 6.0 / 5.0
  CUDA Capability Major/Minor version number: 3.5
  Total amount of global memory: 6143 MBytes (6441730048 bytes)
  (15) Multiprocessors x (192) CUDA Cores/MP: 2880 CUDA Cores
  GPU Clock rate: 980 MHz (0.98 GHz)
  Memory Clock rate: 3500 Mhz
  Memory Bus Width: 384-bit
  L2 Cache Size: 1572864 bytes
  Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
  Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
  Total amount of constant memory: 65536 bytes
  Total amount of shared memory per block: 49152 bytes
  Total number of registers available per block: 65536
  Warp size: 32
  Maximum number of threads per multiprocessor: 2048
  Maximum number of threads per block: 1024
  Maximum sizes of each dimension of a block: 1024 x 1024 x 64
  Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
  Maximum memory pitch: 2147483647 bytes
  Texture alignment: 512 bytes
  Concurrent copy and kernel execution: Yes with 1 copy engine(s)
  Run time limit on kernels: No
  Integrated GPU sharing Host Memory: No
  Support host page-locked memory mapping: Yes
  Alignment requirement for Surfaces: Yes
  Device has ECC support: Disabled
  Device supports Unified Addressing (UVA): Yes
  Device PCI Bus ID / PCI location ID: 4 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "GeForce GTX TITAN Black"
  CUDA Driver Version / Runtime Version 6.0 / 5.0
  CUDA Capability Major/Minor version number: 3.5
  Total amount of global memory: 6143 MBytes (6441730048 bytes)
  (15) Multiprocessors x (192) CUDA Cores/MP: 2880 CUDA Cores
  GPU Clock rate: 980 MHz (0.98 GHz)
  Memory Clock rate: 3500 Mhz
  Memory Bus Width: 384-bit
  L2 Cache Size: 1572864 bytes
  Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
  Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
  Total amount of constant memory: 65536 bytes
  Total amount of shared memory per block: 49152 bytes
  Total number of registers available per block: 65536
  Warp size: 32
  Maximum number of threads per multiprocessor: 2048
  Maximum number of threads per block: 1024
  Maximum sizes of each dimension of a block: 1024 x 1024 x 64
  Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
  Maximum memory pitch: 2147483647 bytes
  Texture alignment: 512 bytes
  Concurrent copy and kernel execution: Yes with 1 copy engine(s)
  Run time limit on kernels: No
  Integrated GPU sharing Host Memory: No
  Support host page-locked memory mapping: Yes
  Alignment requirement for Surfaces: Yes
  Device has ECC support: Disabled
  Device supports Unified Addressing (UVA): Yes
  Device PCI Bus ID / PCI location ID: 8 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 2: "GeForce GTX TITAN Black"
  CUDA Driver Version / Runtime Version 6.0 / 5.0
  CUDA Capability Major/Minor version number: 3.5
  Total amount of global memory: 6143 MBytes (6441730048 bytes)
  (15) Multiprocessors x (192) CUDA Cores/MP: 2880 CUDA Cores
  GPU Clock rate: 980 MHz (0.98 GHz)
  Memory Clock rate: 3500 Mhz
  Memory Bus Width: 384-bit
  L2 Cache Size: 1572864 bytes
  Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
  Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
  Total amount of constant memory: 65536 bytes
  Total amount of shared memory per block: 49152 bytes
  Total number of registers available per block: 65536
  Warp size: 32
  Maximum number of threads per multiprocessor: 2048
  Maximum number of threads per block: 1024
  Maximum sizes of each dimension of a block: 1024 x 1024 x 64
  Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
  Maximum memory pitch: 2147483647 bytes
  Texture alignment: 512 bytes
  Concurrent copy and kernel execution: Yes with 1 copy engine(s)
  Run time limit on kernels: Yes
  Integrated GPU sharing Host Memory: No
  Support host page-locked memory mapping: Yes
  Alignment requirement for Surfaces: Yes
  Device has ECC support: Disabled
  Device supports Unified Addressing (UVA): Yes
  Device PCI Bus ID / PCI location ID: 3 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.0, CUDA Runtime Version = 5.0, NumDevs = 3, Device0 = GeForce GTX TITAN Black, Device1 = GeForce GTX TITAN Black, Device2 = GeForce GTX TITAN Black



________________________________________
From: Jason Swails [jason.swails.gmail.com]
Sent: Wednesday, September 10, 2014 12:45 PM
To: amber.ambermd.org
Subject: Re: [AMBER] Problem running multiple GPU's

On Wed, 2014-09-10 at 15:53 +0000, jon.maguire.louisville.edu wrote:
> We’ve built a system that has 3 Nvidia Titan Blacks. We CAN run pmemd.cuda (and the MPI version) in the following configs
>
> export CUDA_VISIBLE_DEVICES=0
> export CUDA_VISIBLE_DEVICES=0,1
> export CUDA_VISIBLE_DEVICES=0,2
>
> However, we CANNOT run the following:
>
> export CUDA_VISIBLE_DEVICES=1
> export CUDA_VISIBLE_DEVICES=2
> export CUDA_VISIBLE_DEVICES=1,2
>
> We want to run one job per GPU, but amber comes back with “Error
> selecting compatible GPU out of memory” when nothing is running on the
> GPU. Or in the case of running on 1,2, it returns
> “cudaMemcpyToSymbol: SetSim copy to cSim failed out of memory." Is
> there a flag that needs to be set? An nvidia-smi command? Its really
> bizarre behavior!

What happens when you run deviceQuery from the CUDA code samples? Do
you see all 3 GPUs?

It's important to note that the GPU ordering printed by nvidia-smi is
NOT always the same ordering as what the CUDA runtime sees. In order to
get the true device ID -> card mapping, you need to use a program that
actually uses the CUDA API (e.g., deviceQuery).

It could be that you have 4 GPUs on your machine with one powering the
display? And that 4th GPU won't work for Amber? In any case, the
output of deviceQuery will tell us what the CUDA RT expects in terms of
available GPUs and their properties.

HTH,
Jason

--
Jason M. Swails
BioMaPS,
Rutgers University
Postdoctoral Researcher
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Sep 10 2014 - 10:00:03 PDT
Custom Search