Re: [AMBER] Problem running multiple GPU's from Ross Walker on 2014-09-10 (Amber Archive Sep 2014)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Wed, 10 Sep 2014 10:03:10 -0700

Hi Jon,

Weird indeed - I would concentrate on single GPU for now and get that
working properly before trying multi-GPU. But lets try the following:

First please download the following:
http://ambermd.org/gpus/Amber14_GPU_Benchmark_Suite.tar.bz2

untar it and then edit the file run_bench_CPU+GPU.sh

Set the following at the top of the file:
#------ SET FOR YOUR SYSTEM -------
#
GPU_COUNT=3
CPU_COUNT=0
#
#----------------------------------

Then run the following:

unset CUDA_VISIBLE_DEVICES
init 3
./run_bench_CPU+GPU.sh >& run_bench_CPU+GPU.log &
tail -f run_bench_CPU+GPU.log

Once it completes let me know what the log file says.

Note for parallel with 3 GPUs in your system (I am assuming it is dual
socket here?) You will likely have 2 GPUs on 1 socket and 1 on the other.
You will only be able to run in parallel - with peer to peer - across the
two GPUs that are connected to the same CPU socket - trying other
combinations will disable peer to peer and you will end up with
performance slower than a single GPU run. Take a look at the following
http://ambermd.org/gpus/#Running for details on how to test peer to peer
support within your system. If you machine is only single socket, and
without a 8780 PCI-E switch (I know of no motherboards shipping with this
onboard right now) you will likely have one or more of your GPUs demoted
to x8 PCI-E performance which rules out any speedup for parallel PME runs
(GB might still scale).

All the best
Ross

On 9/10/14, 9:46 AM, "jon.maguire.louisville.edu"
<jon.maguire.louisville.edu> wrote:

>One of the three cards is powering a display, although it doesn't really
>need to. Regardless, here is the output from deviceQuery
>
>Detected 3 CUDA Capable device(s)
>
>Device 0: "GeForce GTX TITAN Black"
> CUDA Driver Version / Runtime Version 6.0 / 5.0
> CUDA Capability Major/Minor version number: 3.5
> Total amount of global memory: 6143 MBytes (6441730048
>bytes)
> (15) Multiprocessors x (192) CUDA Cores/MP: 2880 CUDA Cores
> GPU Clock rate: 980 MHz (0.98 GHz)
> Memory Clock rate: 3500 Mhz
> Memory Bus Width: 384-bit
> L2 Cache Size: 1572864 bytes
> Max Texture Dimension Size (x,y,z) 1D=(65536),
>2D=(65536,65536), 3D=(4096,4096,4096)
> Max Layered Texture Size (dim) x layers 1D=(16384) x 2048,
>2D=(16384,16384) x 2048
> Total amount of constant memory: 65536 bytes
> Total amount of shared memory per block: 49152 bytes
> Total number of registers available per block: 65536
> Warp size: 32
> Maximum number of threads per multiprocessor: 2048
> Maximum number of threads per block: 1024
> Maximum sizes of each dimension of a block: 1024 x 1024 x 64
> Maximum sizes of each dimension of a grid: 2147483647 x 65535 x
>65535
> Maximum memory pitch: 2147483647 bytes
> Texture alignment: 512 bytes
> Concurrent copy and kernel execution: Yes with 1 copy engine(s)
> Run time limit on kernels: No
> Integrated GPU sharing Host Memory: No
> Support host page-locked memory mapping: Yes
> Alignment requirement for Surfaces: Yes
> Device has ECC support: Disabled
> Device supports Unified Addressing (UVA): Yes
> Device PCI Bus ID / PCI location ID: 4 / 0
> Compute Mode:
> < Default (multiple host threads can use ::cudaSetDevice() with
>device simultaneously) >
>
>Device 1: "GeForce GTX TITAN Black"
> CUDA Driver Version / Runtime Version 6.0 / 5.0
> CUDA Capability Major/Minor version number: 3.5
> Total amount of global memory: 6143 MBytes (6441730048
>bytes)
> (15) Multiprocessors x (192) CUDA Cores/MP: 2880 CUDA Cores
> GPU Clock rate: 980 MHz (0.98 GHz)
> Memory Clock rate: 3500 Mhz
> Memory Bus Width: 384-bit
> L2 Cache Size: 1572864 bytes
> Max Texture Dimension Size (x,y,z) 1D=(65536),
>2D=(65536,65536), 3D=(4096,4096,4096)
> Max Layered Texture Size (dim) x layers 1D=(16384) x 2048,
>2D=(16384,16384) x 2048
> Total amount of constant memory: 65536 bytes
> Total amount of shared memory per block: 49152 bytes
> Total number of registers available per block: 65536
> Warp size: 32
> Maximum number of threads per multiprocessor: 2048
> Maximum number of threads per block: 1024
> Maximum sizes of each dimension of a block: 1024 x 1024 x 64
> Maximum sizes of each dimension of a grid: 2147483647 x 65535 x
>65535
> Maximum memory pitch: 2147483647 bytes
> Texture alignment: 512 bytes
> Concurrent copy and kernel execution: Yes with 1 copy engine(s)
> Run time limit on kernels: No
> Integrated GPU sharing Host Memory: No
> Support host page-locked memory mapping: Yes
> Alignment requirement for Surfaces: Yes
> Device has ECC support: Disabled
> Device supports Unified Addressing (UVA): Yes
> Device PCI Bus ID / PCI location ID: 8 / 0
> Compute Mode:
> < Default (multiple host threads can use ::cudaSetDevice() with
>device simultaneously) >
>
>Device 2: "GeForce GTX TITAN Black"
> CUDA Driver Version / Runtime Version 6.0 / 5.0
> CUDA Capability Major/Minor version number: 3.5
> Total amount of global memory: 6143 MBytes (6441730048
>bytes)
> (15) Multiprocessors x (192) CUDA Cores/MP: 2880 CUDA Cores
> GPU Clock rate: 980 MHz (0.98 GHz)
> Memory Clock rate: 3500 Mhz
> Memory Bus Width: 384-bit
> L2 Cache Size: 1572864 bytes
> Max Texture Dimension Size (x,y,z) 1D=(65536),
>2D=(65536,65536), 3D=(4096,4096,4096)
> Max Layered Texture Size (dim) x layers 1D=(16384) x 2048,
>2D=(16384,16384) x 2048
> Total amount of constant memory: 65536 bytes
> Total amount of shared memory per block: 49152 bytes
> Total number of registers available per block: 65536
> Warp size: 32
> Maximum number of threads per multiprocessor: 2048
> Maximum number of threads per block: 1024
> Maximum sizes of each dimension of a block: 1024 x 1024 x 64
> Maximum sizes of each dimension of a grid: 2147483647 x 65535 x
>65535
> Maximum memory pitch: 2147483647 bytes
> Texture alignment: 512 bytes
> Concurrent copy and kernel execution: Yes with 1 copy engine(s)
> Run time limit on kernels: Yes
> Integrated GPU sharing Host Memory: No
> Support host page-locked memory mapping: Yes
> Alignment requirement for Surfaces: Yes
> Device has ECC support: Disabled
> Device supports Unified Addressing (UVA): Yes
> Device PCI Bus ID / PCI location ID: 3 / 0
> Compute Mode:
> < Default (multiple host threads can use ::cudaSetDevice() with
>device simultaneously) >
>
>deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.0, CUDA
>Runtime Version = 5.0, NumDevs = 3, Device0 = GeForce GTX TITAN Black,
>Device1 = GeForce GTX TITAN Black, Device2 = GeForce GTX TITAN Black
>
>
>
>________________________________________
>From: Jason Swails [jason.swails.gmail.com]
>Sent: Wednesday, September 10, 2014 12:45 PM
>To: amber.ambermd.org
>Subject: Re: [AMBER] Problem running multiple GPU's
>
>On Wed, 2014-09-10 at 15:53 +0000, jon.maguire.louisville.edu wrote:
>> We¹ve built a system that has 3 Nvidia Titan Blacks. We CAN run
>>pmemd.cuda (and the MPI version) in the following configs
>>
>> export CUDA_VISIBLE_DEVICES=0
>> export CUDA_VISIBLE_DEVICES=0,1
>> export CUDA_VISIBLE_DEVICES=0,2
>>
>> However, we CANNOT run the following:
>>
>> export CUDA_VISIBLE_DEVICES=1
>> export CUDA_VISIBLE_DEVICES=2
>> export CUDA_VISIBLE_DEVICES=1,2
>>
>> We want to run one job per GPU, but amber comes back with ³Error
>> selecting compatible GPU out of memory² when nothing is running on the
>> GPU. Or in the case of running on 1,2, it returns
>> ³cudaMemcpyToSymbol: SetSim copy to cSim failed out of memory." Is
>> there a flag that needs to be set? An nvidia-smi command? Its really
>> bizarre behavior!
>
>What happens when you run deviceQuery from the CUDA code samples? Do
>you see all 3 GPUs?
>
>It's important to note that the GPU ordering printed by nvidia-smi is
>NOT always the same ordering as what the CUDA runtime sees. In order to
>get the true device ID -> card mapping, you need to use a program that
>actually uses the CUDA API (e.g., deviceQuery).
>
>It could be that you have 4 GPUs on your machine with one powering the
>display? And that 4th GPU won't work for Amber? In any case, the
>output of deviceQuery will tell us what the CUDA RT expects in terms of
>available GPUs and their properties.
>
>HTH,
>Jason
>
>--
>Jason M. Swails
>BioMaPS,
>Rutgers University
>Postdoctoral Researcher
>
>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber
>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Sep 10 2014 - 10:30:02 PDT