Re: [AMBER] Problem running multiple GPU's - FIXED from Ross Walker on 2014-09-10 (Amber Archive Sep 2014)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Wed, 10 Sep 2014 11:06:10 -0700

Hi Jon,

Thanks for the update - glad you found the problem. I'll add a note about
this to the AMBER GPU website.

All the best
Ross

On 9/10/14, 10:56 AM, "jon.maguire.louisville.edu"
<jon.maguire.louisville.edu> wrote:

>I was able to find the root of the problem. So here is one thing to
>check if you happen to be running a multi GPU box and run level 5 (GUI).
>
>Make sure the options for SLI and MultiGPU are set to “off” within the
>xorg.conf file.
>
>If those are on, everything is dependent on device 0 to run. Everything
>is back to normal now and thank you all for the help.
>
>-Jon
>
>On Sep 10, 2014, at 1:03 PM, Ross Walker
><ross.rosswalker.co.uk<mailto:ross.rosswalker.co.uk>> wrote:
>
>Hi Jon,
>
>Weird indeed - I would concentrate on single GPU for now and get that
>working properly before trying multi-GPU. But lets try the following:
>
>First please download the following:
>http://ambermd.org/gpus/Amber14_GPU_Benchmark_Suite.tar.bz2
>
>untar it and then edit the file run_bench_CPU+GPU.sh
>
>Set the following at the top of the file:
>#------ SET FOR YOUR SYSTEM -------
>#
>GPU_COUNT=3
>CPU_COUNT=0
>#
>#----------------------------------
>
>Then run the following:
>
>unset CUDA_VISIBLE_DEVICES
>init 3
>./run_bench_CPU+GPU.sh >& run_bench_CPU+GPU.log &
>tail -f run_bench_CPU+GPU.log
>
>Once it completes let me know what the log file says.
>
>Note for parallel with 3 GPUs in your system (I am assuming it is dual
>socket here?) You will likely have 2 GPUs on 1 socket and 1 on the other.
>You will only be able to run in parallel - with peer to peer - across the
>two GPUs that are connected to the same CPU socket - trying other
>combinations will disable peer to peer and you will end up with
>performance slower than a single GPU run. Take a look at the following
>http://ambermd.org/gpus/#Running for details on how to test peer to peer
>support within your system. If you machine is only single socket, and
>without a 8780 PCI-E switch (I know of no motherboards shipping with this
>onboard right now) you will likely have one or more of your GPUs demoted
>to x8 PCI-E performance which rules out any speedup for parallel PME runs
>(GB might still scale).
>
>All the best
>Ross
>
>
>
>On 9/10/14, 9:46 AM,
>"jon.maguire.louisville.edu<mailto:jon.maguire.louisville.edu>"
><jon.maguire.louisville.edu<mailto:jon.maguire.louisville.edu>> wrote:
>
>One of the three cards is powering a display, although it doesn't really
>need to. Regardless, here is the output from deviceQuery
>
>Detected 3 CUDA Capable device(s)
>
>Device 0: "GeForce GTX TITAN Black"
>CUDA Driver Version / Runtime Version 6.0 / 5.0
>CUDA Capability Major/Minor version number: 3.5
>Total amount of global memory: 6143 MBytes (6441730048
>bytes)
>(15) Multiprocessors x (192) CUDA Cores/MP: 2880 CUDA Cores
>GPU Clock rate: 980 MHz (0.98 GHz)
>Memory Clock rate: 3500 Mhz
>Memory Bus Width: 384-bit
>L2 Cache Size: 1572864 bytes
>Max Texture Dimension Size (x,y,z) 1D=(65536),
>2D=(65536,65536), 3D=(4096,4096,4096)
>Max Layered Texture Size (dim) x layers 1D=(16384) x 2048,
>2D=(16384,16384) x 2048
>Total amount of constant memory: 65536 bytes
>Total amount of shared memory per block: 49152 bytes
>Total number of registers available per block: 65536
>Warp size: 32
>Maximum number of threads per multiprocessor: 2048
>Maximum number of threads per block: 1024
>Maximum sizes of each dimension of a block: 1024 x 1024 x 64
>Maximum sizes of each dimension of a grid: 2147483647 x 65535 x
>65535
>Maximum memory pitch: 2147483647 bytes
>Texture alignment: 512 bytes
>Concurrent copy and kernel execution: Yes with 1 copy engine(s)
>Run time limit on kernels: No
>Integrated GPU sharing Host Memory: No
>Support host page-locked memory mapping: Yes
>Alignment requirement for Surfaces: Yes
>Device has ECC support: Disabled
>Device supports Unified Addressing (UVA): Yes
>Device PCI Bus ID / PCI location ID: 4 / 0
>Compute Mode:
> < Default (multiple host threads can use ::cudaSetDevice() with
>device simultaneously) >
>
>Device 1: "GeForce GTX TITAN Black"
>CUDA Driver Version / Runtime Version 6.0 / 5.0
>CUDA Capability Major/Minor version number: 3.5
>Total amount of global memory: 6143 MBytes (6441730048
>bytes)
>(15) Multiprocessors x (192) CUDA Cores/MP: 2880 CUDA Cores
>GPU Clock rate: 980 MHz (0.98 GHz)
>Memory Clock rate: 3500 Mhz
>Memory Bus Width: 384-bit
>L2 Cache Size: 1572864 bytes
>Max Texture Dimension Size (x,y,z) 1D=(65536),
>2D=(65536,65536), 3D=(4096,4096,4096)
>Max Layered Texture Size (dim) x layers 1D=(16384) x 2048,
>2D=(16384,16384) x 2048
>Total amount of constant memory: 65536 bytes
>Total amount of shared memory per block: 49152 bytes
>Total number of registers available per block: 65536
>Warp size: 32
>Maximum number of threads per multiprocessor: 2048
>Maximum number of threads per block: 1024
>Maximum sizes of each dimension of a block: 1024 x 1024 x 64
>Maximum sizes of each dimension of a grid: 2147483647 x 65535 x
>65535
>Maximum memory pitch: 2147483647 bytes
>Texture alignment: 512 bytes
>Concurrent copy and kernel execution: Yes with 1 copy engine(s)
>Run time limit on kernels: No
>Integrated GPU sharing Host Memory: No
>Support host page-locked memory mapping: Yes
>Alignment requirement for Surfaces: Yes
>Device has ECC support: Disabled
>Device supports Unified Addressing (UVA): Yes
>Device PCI Bus ID / PCI location ID: 8 / 0
>Compute Mode:
> < Default (multiple host threads can use ::cudaSetDevice() with
>device simultaneously) >
>
>Device 2: "GeForce GTX TITAN Black"
>CUDA Driver Version / Runtime Version 6.0 / 5.0
>CUDA Capability Major/Minor version number: 3.5
>Total amount of global memory: 6143 MBytes (6441730048
>bytes)
>(15) Multiprocessors x (192) CUDA Cores/MP: 2880 CUDA Cores
>GPU Clock rate: 980 MHz (0.98 GHz)
>Memory Clock rate: 3500 Mhz
>Memory Bus Width: 384-bit
>L2 Cache Size: 1572864 bytes
>Max Texture Dimension Size (x,y,z) 1D=(65536),
>2D=(65536,65536), 3D=(4096,4096,4096)
>Max Layered Texture Size (dim) x layers 1D=(16384) x 2048,
>2D=(16384,16384) x 2048
>Total amount of constant memory: 65536 bytes
>Total amount of shared memory per block: 49152 bytes
>Total number of registers available per block: 65536
>Warp size: 32
>Maximum number of threads per multiprocessor: 2048
>Maximum number of threads per block: 1024
>Maximum sizes of each dimension of a block: 1024 x 1024 x 64
>Maximum sizes of each dimension of a grid: 2147483647 x 65535 x
>65535
>Maximum memory pitch: 2147483647 bytes
>Texture alignment: 512 bytes
>Concurrent copy and kernel execution: Yes with 1 copy engine(s)
>Run time limit on kernels: Yes
>Integrated GPU sharing Host Memory: No
>Support host page-locked memory mapping: Yes
>Alignment requirement for Surfaces: Yes
>Device has ECC support: Disabled
>Device supports Unified Addressing (UVA): Yes
>Device PCI Bus ID / PCI location ID: 3 / 0
>Compute Mode:
> < Default (multiple host threads can use ::cudaSetDevice() with
>device simultaneously) >
>
>deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.0, CUDA
>Runtime Version = 5.0, NumDevs = 3, Device0 = GeForce GTX TITAN Black,
>Device1 = GeForce GTX TITAN Black, Device2 = GeForce GTX TITAN Black
>
>
>
>________________________________________
>From: Jason Swails [jason.swails.gmail.com<mailto:jason.swails.gmail.com>]
>Sent: Wednesday, September 10, 2014 12:45 PM
>To: amber.ambermd.org<mailto:amber.ambermd.org>
>Subject: Re: [AMBER] Problem running multiple GPU's
>
>On Wed, 2014-09-10 at 15:53 +0000,
>jon.maguire.louisville.edu<mailto:jon.maguire.louisville.edu> wrote:
>We¹ve built a system that has 3 Nvidia Titan Blacks. We CAN run
>pmemd.cuda (and the MPI version) in the following configs
>
>export CUDA_VISIBLE_DEVICES=0
>export CUDA_VISIBLE_DEVICES=0,1
>export CUDA_VISIBLE_DEVICES=0,2
>
>However, we CANNOT run the following:
>
>export CUDA_VISIBLE_DEVICES=1
>export CUDA_VISIBLE_DEVICES=2
>export CUDA_VISIBLE_DEVICES=1,2
>
>We want to run one job per GPU, but amber comes back with ³Error
>selecting compatible GPU out of memory² when nothing is running on the
>GPU. Or in the case of running on 1,2, it returns
>³cudaMemcpyToSymbol: SetSim copy to cSim failed out of memory." Is
>there a flag that needs to be set? An nvidia-smi command? Its really
>bizarre behavior!
>
>What happens when you run deviceQuery from the CUDA code samples? Do
>you see all 3 GPUs?
>
>It's important to note that the GPU ordering printed by nvidia-smi is
>NOT always the same ordering as what the CUDA runtime sees. In order to
>get the true device ID -> card mapping, you need to use a program that
>actually uses the CUDA API (e.g., deviceQuery).
>
>It could be that you have 4 GPUs on your machine with one powering the
>display? And that 4th GPU won't work for Amber? In any case, the
>output of deviceQuery will tell us what the CUDA RT expects in terms of
>available GPUs and their properties.
>
>HTH,
>Jason
>
>--
>Jason M. Swails
>BioMaPS,
>Rutgers University
>Postdoctoral Researcher
>
>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org<mailto:AMBER.ambermd.org>
>http://lists.ambermd.org/mailman/listinfo/amber
>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org<mailto:AMBER.ambermd.org>
>http://lists.ambermd.org/mailman/listinfo/amber
>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Sep 10 2014 - 11:30:02 PDT