[AMBER] benchmarking Tesla S2050 1U system

From: Sidney Elmer <paulymer.gmail.com>
Date: Wed, 14 Sep 2011 11:52:38 -0700

Hi,

I have system which contains the Tesla S2050 1U system (
http://www.nvidia.com/object/preconfigured-clusters.html). The hardware
configuration is different than normally seen. It has a separate box which
contains 4 M2050 GPUs. The interesting thing about this configuration,
though, is that the box connects to the host CPU by two cables through the
PCIe slots. The main CPU box has two PCIe slots, so that apparently, two
GPUs are accessible through 1 PCIe slot. This results in the nvidia driver
recognizing only two CUDA devices, as shown by the output of deviceQuery:

$ deviceQuery -noprompt | egrep '^Device'
Device 0: "Tesla S2050"
Device 1: "Tesla S2050"


However, there are actually four GPUs, as stated earlier.

I have compiled and installed the CUDA enabled programs on my system, both
the serial versions and the parallel versions with the latest drivers and
compilers:

CPU=4x8 core Intel(R) Xeon(R) CPU X7560 . 2.27GHz
GPU=Tesla S2050 1U system
mpich2 1.2.1p1
gfortran 4.1.2
nvcc v4.0
NVIDIA Driver Linux 64 - v270.41.34

I am also using the latest version of AmberTools 1.5 with all the latest
bugfixes 1 to 17. All tests for AmberTools and Amber11 programs passed,
including serial, parallel, serial.cuda, and parallel.cuda.

Now, I am trying to benchmark my system and getting results that are not
ideal. Here is the summary for the DHFR system in the
$AMBERHOME/benchmarks/dhfr directory (22930 atoms):

   1. Serial: 1 gpu - 23.5 ns/day
   2. Parallel: 2 gpus (same device 0) - 16.8 ns/day
   3. Parallel: 2 gpus (same device 1) - 16.7 ns/day
   4. Parallel: 2 gpus (separate devices) - 27.6 ns/day
   5. Parallel: 4 gpus (max possible) - 21.3 ns/day

Needless to say, I was disappointed with these results, especially with 4
gpus running in parallel which is slower than simply running a single serial
job. The instructions for running on multiple GPUs state that the selection
of GPUs to be used is automatic, but it appears this is not the case. It
appears to me that in scenarios #2 and #3 above where I limited the
available devices using CUDA_VISIBLE_DEVICES to a single device, both MPI
threads were assigned to the same GPU on that device, rather than one thread
assigned per GPU. Is this correct? If so, how would I specify that each
GPU gets a thread? I think this is the same thing that is happening with 4
gpus - both devices are available, but the two threads sent to each device
are assigned to the same GPU, rather than 1 thread per GPU. I would
appreciate any advice. Thank you.

Sid
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Sep 14 2011 - 12:00:04 PDT
Custom Search