One of the three cards is powering a display, although it doesn't really need to. Regardless, here is the output from deviceQuery
Detected 3 CUDA Capable device(s)
Device 0: "GeForce GTX TITAN Black"
CUDA Driver Version / Runtime Version 6.0 / 5.0
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 6143 MBytes (6441730048 bytes)
(15) Multiprocessors x (192) CUDA Cores/MP: 2880 CUDA Cores
GPU Clock rate: 980 MHz (0.98 GHz)
Memory Clock rate: 3500 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 4 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "GeForce GTX TITAN Black"
CUDA Driver Version / Runtime Version 6.0 / 5.0
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 6143 MBytes (6441730048 bytes)
(15) Multiprocessors x (192) CUDA Cores/MP: 2880 CUDA Cores
GPU Clock rate: 980 MHz (0.98 GHz)
Memory Clock rate: 3500 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 8 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 2: "GeForce GTX TITAN Black"
CUDA Driver Version / Runtime Version 6.0 / 5.0
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 6143 MBytes (6441730048 bytes)
(15) Multiprocessors x (192) CUDA Cores/MP: 2880 CUDA Cores
GPU Clock rate: 980 MHz (0.98 GHz)
Memory Clock rate: 3500 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 3 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.0, CUDA Runtime Version = 5.0, NumDevs = 3, Device0 = GeForce GTX TITAN Black, Device1 = GeForce GTX TITAN Black, Device2 = GeForce GTX TITAN Black
________________________________________
From: Jason Swails [jason.swails.gmail.com]
Sent: Wednesday, September 10, 2014 12:45 PM
To: amber.ambermd.org
Subject: Re: [AMBER] Problem running multiple GPU's
On Wed, 2014-09-10 at 15:53 +0000, jon.maguire.louisville.edu wrote:
> We’ve built a system that has 3 Nvidia Titan Blacks. We CAN run pmemd.cuda (and the MPI version) in the following configs
>
> export CUDA_VISIBLE_DEVICES=0
> export CUDA_VISIBLE_DEVICES=0,1
> export CUDA_VISIBLE_DEVICES=0,2
>
> However, we CANNOT run the following:
>
> export CUDA_VISIBLE_DEVICES=1
> export CUDA_VISIBLE_DEVICES=2
> export CUDA_VISIBLE_DEVICES=1,2
>
> We want to run one job per GPU, but amber comes back with “Error
> selecting compatible GPU out of memory” when nothing is running on the
> GPU. Or in the case of running on 1,2, it returns
> “cudaMemcpyToSymbol: SetSim copy to cSim failed out of memory." Is
> there a flag that needs to be set? An nvidia-smi command? Its really
> bizarre behavior!
What happens when you run deviceQuery from the CUDA code samples? Do
you see all 3 GPUs?
It's important to note that the GPU ordering printed by nvidia-smi is
NOT always the same ordering as what the CUDA runtime sees. In order to
get the true device ID -> card mapping, you need to use a program that
actually uses the CUDA API (e.g., deviceQuery).
It could be that you have 4 GPUs on your machine with one powering the
display? And that 4th GPU won't work for Amber? In any case, the
output of deviceQuery will tell us what the CUDA RT expects in terms of
available GPUs and their properties.
HTH,
Jason
--
Jason M. Swails
BioMaPS,
Rutgers University
Postdoctoral Researcher
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Sep 10 2014 - 10:00:03 PDT