Re: [AMBER] benchmarking Tesla S2050 1U system from Ross Walker on 2011-09-14 (Amber Archive Sep 2011)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Wed, 14 Sep 2011 12:41:19 -0700

Hi Sid,

> I have system which contains the Tesla S2050 1U system (
> http://www.nvidia.com/object/preconfigured-clusters.html). The
> hardware
> configuration is different than normally seen. It has a separate box
> which
> contains 4 M2050 GPUs. The interesting thing about this configuration,
> though, is that the box connects to the host CPU by two cables through
> the
> PCIe slots. The main CPU box has two PCIe slots, so that apparently,

These break out boxes are a pain in the butt!. Firstly they split two PCI-E
x16 slots into 4 PCI-E x8 slots which pretty much destroys any chance of
running well in parallel as shown by your tests below. If you put just two
GPUs in the breakout box then you may be ok as long as you put one on a
PCI-E slot connecting to one of the external cables and one to the other. If
they are sharing a cable then you are halving the bandwidth which is no good
for parallel.

> two
> GPUs are accessible through 1 PCIe slot. This results in the nvidia
> driver
> recognizing only two CUDA devices, as shown by the output of
> deviceQuery:

If deviceQuery shows only 2 GPUs then this is all that are active in your OS
so this is all that AMBER will use. You will need to figure out how to get
the driver and deviceQuery to see all 4 GPUs in one node before you can use
them all in said node. Note the break out box is really 2 independent PCI-E
splits as follows:

        x16 |---GPU 0 (x8)
Lead 1 ------|
             |---GPU 1 (x8)

        x16 |---GPU 2 (x8)
Lead 2 ------|
             |---GPU 3 (x8)

So you need BOTH leads plugged into the node in order to see all 4 GPUs.
There is no internal connection between the two banks of two GPUs. Note also
that the OS may not actually end up assigning the GPU IDs in way that
matches the layout in the box. I.e. the GPU shown as GPU 1 above may
actually be given hardware ID 3 for example.

> $ deviceQuery -noprompt | egrep '^Device'
> Device 0: "Tesla S2050"
> Device 1: "Tesla S2050"
>
> However, there are actually four GPUs, as stated earlier.

No there are NOT. There are two GPUs active and that is it. You should see 4
device ID's shown here. Those boxes are totally passive. All they do is
split your two PCI-e x16 slots into 4 PCI-e x8 slots.

You must have cabled things up incorrectly OR something is wrong with the
hardware, Or the driver installation is messed up in some way. There is
nothing AMBER can do to help you here.

> Now, I am trying to benchmark my system and getting results that are
> not
> ideal. Here is the summary for the DHFR system in the
> $AMBERHOME/benchmarks/dhfr directory (22930 atoms):
>
> 1. Serial: 1 gpu - 23.5 ns/day
> 2. Parallel: 2 gpus (same device 0) - 16.8 ns/day
> 3. Parallel: 2 gpus (same device 1) - 16.7 ns/day
> 4. Parallel: 2 gpus (separate devices) - 27.6 ns/day
> 5. Parallel: 4 gpus (max possible) - 21.3 ns/day

These results are meaningless until you fix things to show all 4 GPUs. What
you list as point 2. here is actually both threads running on the SAME
physical GPU, hence the reason it slows down. Same goes for point 3. Point
4. is good, this is actually running on 2 physical GPUs although they have
only x8 connection speed which is why the scaling is so poor. Point 5 is
actually 4 threads running on 2 physical GPUs which is a disaster.

> Needless to say, I was disappointed with these results, especially with
> 4
> gpus running in parallel which is slower than simply running a single
> serial
> job. The instructions for running on multiple GPUs state that the
> selection
> of GPUs to be used is automatic, but it appears this is not the case.

It is automatic but assumes the OS and hardware are setup and working
correctly.

> It
> appears to me that in scenarios #2 and #3 above where I limited the
> available devices using CUDA_VISIBLE_DEVICES to a single device, both
> MPI
> threads were assigned to the same GPU on that device, rather than one
> thread
> assigned per GPU. Is this correct? If so, how would I specify that

Exactly.

> each
> GPU gets a thread? I think this is the same thing that is happening
> with 4
> gpus - both devices are available, but the two threads sent to each
> device
> are assigned to the same GPU, rather than 1 thread per GPU. I would
> appreciate any advice. Thank you.

Make sure everything is cabled up correctly first. That is you have both of
the cables from the break out box plugged into your node and can see all 4
GPUs in device query. If you plug both cables in properly and still only see
two GPUs then something is wrong with the hardware, faulty cable maybe?

Note even once this is working you are still going to have limited
performance over the 4 GPUs due to splitting the PCI-E bandwidth (damn I
wish they would not make these boxes / or that it was a requirement to have
an engineering degree to actually work in marketing!!!).

Note I have worked with NVIDIA and some of their partners to attempt to spec
up a hardware configuration that represents an almost optimum performance
configuration (given currently available hardware) within a reasonable price
bracket (i.e. not going all the way down to one GPU per node). This is
termed the MDSimCluster project and I will be updating the AMBER website
with details shortly. For now you can get info here:
http://www.nvidia.com/object/molecular-dynamics-simcluster.html

This has 2 M2090 GPUs per node (M2050s or M2070s would also work) and a
single QDR IB card per node for up to 4 nodes and hence 8 GPUs. All are in
x16 slots an the performance should be equivalent to that shown on the
following page for the 2 GPU per node cases.
http://ambermd.org/gpus/benchmarks.htm#Benchmarks

I would also suggest downloading the benchmarks from this page
http://ambermd.org/amber11_bench_files/Amber11_Benchmark_Suite.tar.gz and
using those for testing since they use settings more optimal for production
runs.

All the best
Ross

/\
\/
|\oss Walker

---------------------------------------------------------
| Assistant Research Professor |
| San Diego Supercomputer Center |
| Adjunct Assistant Professor |
| Dept. of Chemistry and Biochemistry |
| University of California San Diego |
| NVIDIA Fellow |
| http://www.rosswalker.co.uk | http://www.wmd-lab.org/ |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
---------------------------------------------------------

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Sep 14 2011 - 13:00:03 PDT