[AMBER] Horrific pmemd.cuda performance on an 8-GPU system from Sasha Buzko on 2010-08-11 (Amber Archive Aug 2010)

From: Sasha Buzko <obuzko.ucla.edu>
Date: Wed, 11 Aug 2010 12:15:59 -0700

Hi guys,
We just got an 8-GPU system that uses onboard PLX switches that expand
four x16 PCIe slots on a Tyan board to eght. I ran a small production
simulation test, and the results are horrifying, to say the least.
If you could help narrow down possible sources of the problem, it would
be great.

The system in question is made by Colfax (CXT-8000,
www.colfax-intl.com/nvidiaGPU.html). We used GTX480s with 12 GB of RAM
and two 2.5" 7200 rpm 320 GB hard drives. The cpu is a dual quad-core
Xeon 5550..

The test simulation was run on a single GPU and ended up more than 5
times slower than the identical simulation on a reference system. The
20000-step run took 1.07 hours on the Colfax system, but only 0.2 hours
on the reference desktop.

The reference system is a Dell workstation with an identical GTX480 and
the same CUDA driver. I'm giving below the bandwidth and deviceQuery
output for both systems.

Could it be a BIOS problem, an issue with the PLX switching, slow hard
drive? I've tried several tests from the Nvidia test suite that came
with ./deviceQuery, and the performance numbers were similar to those
from the reference system. Right now I'm trying to get hold of non-amber
code that would take longer.

Any suggestions would be very much appreciated.

Thanks

Sasha

############# diagnostic output ###############

Test system (Colfax) with 6 cards:

./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

There are 6 devices supporting CUDA

Device 0: "GeForce GTX 480"
  CUDA Driver Version: 3.0
  CUDA Runtime Version: 3.0
  CUDA Capability Major revision number: 2
  CUDA Capability Minor revision number: 0
  Total amount of global memory: 1610285056 bytes
  Number of multiprocessors: 15
  Number of cores: 480
  Total amount of constant memory: 65536 bytes
  Total amount of shared memory per block: 49152 bytes
  Total number of registers available per block: 32768
  Warp size: 32
  Maximum number of threads per block: 1024
  Maximum sizes of each dimension of a block: 1024 x 1024 x 64
  Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
  Maximum memory pitch: 2147483647 bytes
  Texture alignment: 512 bytes
  Clock rate: 1.40 GHz
  Concurrent copy and execution: Yes
  Run time limit on kernels: No
  Integrated: No
  Support host page-locked memory mapping: Yes
  Compute mode: Default (multiple host
threads can use this device simultaneously)

Device 1: "GeForce GTX 480"
  CUDA Driver Version: 3.0
  CUDA Runtime Version: 3.0
  CUDA Capability Major revision number: 2
  CUDA Capability Minor revision number: 0
  Total amount of global memory: 1610285056 bytes
  Number of multiprocessors: 15
  Number of cores: 480
  Total amount of constant memory: 65536 bytes
  Total amount of shared memory per block: 49152 bytes
  Total number of registers available per block: 32768
  Warp size: 32
  Maximum number of threads per block: 1024
  Maximum sizes of each dimension of a block: 1024 x 1024 x 64
  Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
  Maximum memory pitch: 2147483647 bytes
  Texture alignment: 512 bytes
  Clock rate: 1.40 GHz
  Concurrent copy and execution: Yes
  Run time limit on kernels: No
  Integrated: No
  Support host page-locked memory mapping: Yes
  Compute mode: Default (multiple host
threads can use this device simultaneously)

Device 2: "GeForce GTX 480"
  CUDA Driver Version: 3.0
  CUDA Runtime Version: 3.0
  CUDA Capability Major revision number: 2
  CUDA Capability Minor revision number: 0
  Total amount of global memory: 1610285056 bytes
  Number of multiprocessors: 15
  Number of cores: 480
  Total amount of constant memory: 65536 bytes
  Total amount of shared memory per block: 49152 bytes
  Total number of registers available per block: 32768
  Warp size: 32
  Maximum number of threads per block: 1024
  Maximum sizes of each dimension of a block: 1024 x 1024 x 64
  Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
  Maximum memory pitch: 2147483647 bytes
  Texture alignment: 512 bytes
  Clock rate: 1.40 GHz
  Concurrent copy and execution: Yes
  Run time limit on kernels: No
  Integrated: No
  Support host page-locked memory mapping: Yes
  Compute mode: Default (multiple host
threads can use this device simultaneously)

Device 3: "GeForce GTX 480"
  CUDA Driver Version: 3.0
  CUDA Runtime Version: 3.0
  CUDA Capability Major revision number: 2
  CUDA Capability Minor revision number: 0
  Total amount of global memory: 1610285056 bytes
  Number of multiprocessors: 15
  Number of cores: 480
  Total amount of constant memory: 65536 bytes
  Total amount of shared memory per block: 49152 bytes
  Total number of registers available per block: 32768
  Warp size: 32
  Maximum number of threads per block: 1024
  Maximum sizes of each dimension of a block: 1024 x 1024 x 64
  Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
  Maximum memory pitch: 2147483647 bytes
  Texture alignment: 512 bytes
  Clock rate: 1.40 GHz
  Concurrent copy and execution: Yes
  Run time limit on kernels: No
  Integrated: No
  Support host page-locked memory mapping: Yes
  Compute mode: Default (multiple host
threads can use this device simultaneously)

Device 4: "GeForce GTX 480"
  CUDA Driver Version: 3.0
  CUDA Runtime Version: 3.0
  CUDA Capability Major revision number: 2
  CUDA Capability Minor revision number: 0
  Total amount of global memory: 1610285056 bytes
  Number of multiprocessors: 15
  Number of cores: 480
  Total amount of constant memory: 65536 bytes
  Total amount of shared memory per block: 49152 bytes
  Total number of registers available per block: 32768
  Warp size: 32
  Maximum number of threads per block: 1024
  Maximum sizes of each dimension of a block: 1024 x 1024 x 64
  Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
  Maximum memory pitch: 2147483647 bytes
  Texture alignment: 512 bytes
  Clock rate: 1.40 GHz
  Concurrent copy and execution: Yes
  Run time limit on kernels: No
  Integrated: No
  Support host page-locked memory mapping: Yes
  Compute mode: Default (multiple host
threads can use this device simultaneously)

Device 5: "GeForce GTX 480"
  CUDA Driver Version: 3.0
  CUDA Runtime Version: 3.0
  CUDA Capability Major revision number: 2
  CUDA Capability Minor revision number: 0
  Total amount of global memory: 1610285056 bytes
  Number of multiprocessors: 15
  Number of cores: 480
  Total amount of constant memory: 65536 bytes
  Total amount of shared memory per block: 49152 bytes
  Total number of registers available per block: 32768
  Warp size: 32
  Maximum number of threads per block: 1024
  Maximum sizes of each dimension of a block: 1024 x 1024 x 64
  Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
  Maximum memory pitch: 2147483647 bytes
  Texture alignment: 512 bytes
  Clock rate: 1.40 GHz
  Concurrent copy and execution: Yes
  Run time limit on kernels: No
  Integrated: No
  Support host page-locked memory mapping: Yes
  Compute mode: Default (multiple host
threads can use this device simultaneously)

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4243455, CUDA
Runtime Version = 3.0, NumDevs = 6, Device = GeForce GTX 480, Device =
GeForce GTX 480

PASSED

[bandwidthTest]
./bandwidthTest Starting...

Running on...

Device 0: GeForce GTX 480
Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory
   Transfer Size (Bytes) Bandwidth(MB/s)
   33554432 3306.6

Device to Host Bandwidth, 1 Device(s), Paged memory
   Transfer Size (Bytes) Bandwidth(MB/s)
   33554432 3086.5

Device to Device Bandwidth, 1 Device(s)
   Transfer Size (Bytes) Bandwidth(MB/s)
   33554432 111378.4

[bandwidthTest] - Test results:
PASSED

Reference system (Dell):

./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

There is 1 device supporting CUDA

Device 0: "GeForce GTX 480"
  CUDA Driver Version: 3.0
  CUDA Runtime Version: 3.0
  CUDA Capability Major revision number: 2
  CUDA Capability Minor revision number: 0
  Total amount of global memory: 1609760768 bytes
  Number of multiprocessors: 15
  Number of cores: 480
  Total amount of constant memory: 65536 bytes
  Total amount of shared memory per block: 49152 bytes
  Total number of registers available per block: 32768
  Warp size: 32
  Maximum number of threads per block: 1024
  Maximum sizes of each dimension of a block: 1024 x 1024 x 64
  Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
  Maximum memory pitch: 2147483647 bytes
  Texture alignment: 512 bytes
  Clock rate: 1.40 GHz
  Concurrent copy and execution: Yes
  Run time limit on kernels: Yes
  Integrated: No
  Support host page-locked memory mapping: Yes
  Compute mode: Default (multiple host
threads can use this device simultaneously)

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4243455, CUDA
Runtime Version = 3.0, NumDevs = 1, Device = GeForce GTX 480

PASSED

[bandwidthTest]
./bandwidthTest Starting...

Running on...

Device 0: GeForce GTX 480
Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory
   Transfer Size (Bytes) Bandwidth(MB/s)
   33554432 1982.9

Device to Host Bandwidth, 1 Device(s), Paged memory
   Transfer Size (Bytes) Bandwidth(MB/s)
   33554432 1623.1

Device to Device Bandwidth, 1 Device(s)
   Transfer Size (Bytes) Bandwidth(MB/s)
   33554432 74032.1

[bandwidthTest] - Test results:
PASSED

Ross Walker wrote:
>> In theory, then, we could even take a board with 3+ PCI-E x16 slots and
>> max them out with DHICs (as long as we can maintain the corresponding
>> processor core count)...
>>
>
> Yes... Although don't forget to add enough memory for the CPUs as well.
> Probably 1 to 1.5GB of main memory per GPU would be optimal.
>
> All the best
> Ross
>
> /\
> \/
> |\oss Walker
>
> ---------------------------------------------------------
> | Assistant Research Professor |
> | San Diego Supercomputer Center |
> | Adjunct Assistant Professor |
> | Dept. of Chemistry and Biochemistry |
> | University of California San Diego |
> | http://www.rosswalker.co.uk | http://www.wmd-lab.org/ |
> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
> ---------------------------------------------------------
>
> Note: Electronic Mail is not secure, has no guarantee of delivery, may not
> be read every day, and should not be used for urgent or sensitive issues.
>
>
>
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Aug 11 2010 - 12:30:04 PDT