I was able to find the root of the problem. So here is one thing to check if you happen to be running a multi GPU box and run level 5 (GUI).

Make sure the options for SLI and MultiGPU are set to “off” within the xorg.conf file.

If those are on, everything is dependent on device 0 to run. Everything is back to normal now and thank you all for the help.


Hi Jon,

Weird indeed - I would concentrate on single GPU for now and get that
working properly before trying multi-GPU. But lets try the following:

First please download the following:

untar it and then edit the file

Set the following at the top of the file:
#------ SET FOR YOUR SYSTEM -------

Then run the following:

init 3
./ >& run_bench_CPU+GPU.log &
tail -f run_bench_CPU+GPU.log

Once it completes let me know what the log file says.

Note for parallel with 3 GPUs in your system (I am assuming it is dual
socket here?) You will likely have 2 GPUs on 1 socket and 1 on the other.
You will only be able to run in parallel - with peer to peer - across the
two GPUs that are connected to the same CPU socket - trying other
combinations will disable peer to peer and you will end up with
performance slower than a single GPU run. Take a look at the following for details on how to test peer to peer
support within your system. If you machine is only single socket, and
without a 8780 PCI-E switch (I know of no motherboards shipping with this
onboard right now) you will likely have one or more of your GPUs demoted
to x8 PCI-E performance which rules out any speedup for parallel PME runs
(GB might still scale).

All the best

One of the three cards is powering a display, although it doesn't really
need to. Regardless, here is the output from deviceQuery

Detected 3 CUDA Capable device(s)

Device 0: "GeForce GTX TITAN Black"
CUDA Driver Version / Runtime Version 6.0 / 5.0
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 6143 MBytes (6441730048
(15) Multiprocessors x (192) CUDA Cores/MP: 2880 CUDA Cores
GPU Clock rate: 980 MHz (0.98 GHz)
Memory Clock rate: 3500 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536),
2D=(65536,65536), 3D=(4096,4096,4096)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048,
2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 4 / 0
Compute Mode:
   < Default (multiple host threads can use ::cudaSetDevice() with
device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.0, CUDA
Runtime Version = 5.0, NumDevs = 3, Device0 = GeForce GTX TITAN Black,
Device1 = GeForce GTX TITAN Black, Device2 = GeForce GTX TITAN Black

We¹ve built a system that has 3 Nvidia Titan Blacks. We CAN run
pmemd.cuda (and the MPI version) in the following configs


However, we CANNOT run the following:


We want to run one job per GPU, but amber comes back with ³Error
selecting compatible GPU out of memory² when nothing is running on the
GPU. Or in the case of running on 1,2, it returns
³cudaMemcpyToSymbol: SetSim copy to cSim failed out of memory." Is
there a flag that needs to be set? An nvidia-smi command? Its really
bizarre behavior!

What happens when you run deviceQuery from the CUDA code samples? Do
you see all 3 GPUs?

It's important to note that the GPU ordering printed by nvidia-smi is
NOT always the same ordering as what the CUDA runtime sees. In order to
get the true device ID -> card mapping, you need to use a program that
actually uses the CUDA API (e.g., deviceQuery).

It could be that you have 4 GPUs on your machine with one powering the
display? And that 4th GPU won't work for Amber? In any case, the
output of deviceQuery will tell us what the CUDA RT expects in terms of
available GPUs and their properties.


