Re: [AMBER] benchmarking Tesla S2050 1U system from Sidney Elmer on 2011-09-15 (Amber Archive Sep 2011)

From: Sidney Elmer <paulymer.gmail.com>
Date: Thu, 15 Sep 2011 17:08:27 -0700

It turns out that I had a defective cable, so only 2 of the GPUs were
actually connected to the CPU host box. Now that I have 4 GPU devices
available, I redid my benchmarks using the same DHFR system as before (I
will use the new benchmarks in future studies). Here are the results:

   - 1 GPU: 22.4 ns/day
   - 2 GPU: 26.7 ns/day
   - 4 GPU: 35.0 ns/day

Still not great scaling, but that is not surprising given the hardware
configuration using x8 PCIe connections, rather x16. Thanks for your help.

Sid

On Wed, Sep 14, 2011 at 3:55 PM, Sidney Elmer <paulymer.gmail.com> wrote:

> Hi Ross,
>
> Thank you for your very detailed answer, it is very helpful. I agree with
> everything you said and it all makes sense. It turns out that the
> Installation Guide for the system was confusing on how to verify that the
> devices are recognized. I was under the impression that I would only see 2
> 3D controllers with lspci (which is all I see) and that deviceQuery would
> list only two devices. When that is all I saw, I thought that everything
> had been installed and configured correctly. It didn't make sense to me,
> but that is what the manual said, so I accepted it at the time. The results
> demonstrate that that is not correct, and I will try to fix it so that it
> looks more like how you described it. Thanks for the clarification. Best.
>
> Sid
>
>
> On Wed, Sep 14, 2011 at 12:41 PM, Ross Walker <ross.rosswalker.co.uk>wrote:
>
>> Hi Sid,
>>
>> > I have system which contains the Tesla S2050 1U system (
>> > http://www.nvidia.com/object/preconfigured-clusters.html). The
>> > hardware
>> > configuration is different than normally seen. It has a separate box
>> > which
>> > contains 4 M2050 GPUs. The interesting thing about this configuration,
>> > though, is that the box connects to the host CPU by two cables through
>> > the
>> > PCIe slots. The main CPU box has two PCIe slots, so that apparently,
>>
>> These break out boxes are a pain in the butt!. Firstly they split two
>> PCI-E
>> x16 slots into 4 PCI-E x8 slots which pretty much destroys any chance of
>> running well in parallel as shown by your tests below. If you put just two
>> GPUs in the breakout box then you may be ok as long as you put one on a
>> PCI-E slot connecting to one of the external cables and one to the other.
>> If
>> they are sharing a cable then you are halving the bandwidth which is no
>> good
>> for parallel.
>>
>> > two
>> > GPUs are accessible through 1 PCIe slot. This results in the nvidia
>> > driver
>> > recognizing only two CUDA devices, as shown by the output of
>> > deviceQuery:
>>
>> If deviceQuery shows only 2 GPUs then this is all that are active in your
>> OS
>> so this is all that AMBER will use. You will need to figure out how to get
>> the driver and deviceQuery to see all 4 GPUs in one node before you can
>> use
>> them all in said node. Note the break out box is really 2 independent
>> PCI-E
>> splits as follows:
>>
>> x16 |---GPU 0 (x8)
>> Lead 1 ------|
>> |---GPU 1 (x8)
>>
>> x16 |---GPU 2 (x8)
>> Lead 2 ------|
>> |---GPU 3 (x8)
>>
>> So you need BOTH leads plugged into the node in order to see all 4 GPUs.
>> There is no internal connection between the two banks of two GPUs. Note
>> also
>> that the OS may not actually end up assigning the GPU IDs in way that
>> matches the layout in the box. I.e. the GPU shown as GPU 1 above may
>> actually be given hardware ID 3 for example.
>>
>> > $ deviceQuery -noprompt | egrep '^Device'
>> > Device 0: "Tesla S2050"
>> > Device 1: "Tesla S2050"
>> >
>> > However, there are actually four GPUs, as stated earlier.
>>
>> No there are NOT. There are two GPUs active and that is it. You should see
>> 4
>> device ID's shown here. Those boxes are totally passive. All they do is
>> split your two PCI-e x16 slots into 4 PCI-e x8 slots.
>>
>> You must have cabled things up incorrectly OR something is wrong with the
>> hardware, Or the driver installation is messed up in some way. There is
>> nothing AMBER can do to help you here.
>>
>> > Now, I am trying to benchmark my system and getting results that are
>> > not
>> > ideal. Here is the summary for the DHFR system in the
>> > $AMBERHOME/benchmarks/dhfr directory (22930 atoms):
>> >
>> > 1. Serial: 1 gpu - 23.5 ns/day
>> > 2. Parallel: 2 gpus (same device 0) - 16.8 ns/day
>> > 3. Parallel: 2 gpus (same device 1) - 16.7 ns/day
>> > 4. Parallel: 2 gpus (separate devices) - 27.6 ns/day
>> > 5. Parallel: 4 gpus (max possible) - 21.3 ns/day
>>
>> These results are meaningless until you fix things to show all 4 GPUs.
>> What
>> you list as point 2. here is actually both threads running on the SAME
>> physical GPU, hence the reason it slows down. Same goes for point 3. Point
>> 4. is good, this is actually running on 2 physical GPUs although they have
>> only x8 connection speed which is why the scaling is so poor. Point 5 is
>> actually 4 threads running on 2 physical GPUs which is a disaster.
>>
>> > Needless to say, I was disappointed with these results, especially with
>> > 4
>> > gpus running in parallel which is slower than simply running a single
>> > serial
>> > job. The instructions for running on multiple GPUs state that the
>> > selection
>> > of GPUs to be used is automatic, but it appears this is not the case.
>>
>> It is automatic but assumes the OS and hardware are setup and working
>> correctly.
>>
>> > It
>> > appears to me that in scenarios #2 and #3 above where I limited the
>> > available devices using CUDA_VISIBLE_DEVICES to a single device, both
>> > MPI
>> > threads were assigned to the same GPU on that device, rather than one
>> > thread
>> > assigned per GPU. Is this correct? If so, how would I specify that
>>
>> Exactly.
>>
>> > each
>> > GPU gets a thread? I think this is the same thing that is happening
>> > with 4
>> > gpus - both devices are available, but the two threads sent to each
>> > device
>> > are assigned to the same GPU, rather than 1 thread per GPU. I would
>> > appreciate any advice. Thank you.
>>
>> Make sure everything is cabled up correctly first. That is you have both
>> of
>> the cables from the break out box plugged into your node and can see all 4
>> GPUs in device query. If you plug both cables in properly and still only
>> see
>> two GPUs then something is wrong with the hardware, faulty cable maybe?
>>
>> Note even once this is working you are still going to have limited
>> performance over the 4 GPUs due to splitting the PCI-E bandwidth (damn I
>> wish they would not make these boxes / or that it was a requirement to
>> have
>> an engineering degree to actually work in marketing!!!).
>>
>> Note I have worked with NVIDIA and some of their partners to attempt to
>> spec
>> up a hardware configuration that represents an almost optimum performance
>> configuration (given currently available hardware) within a reasonable
>> price
>> bracket (i.e. not going all the way down to one GPU per node). This is
>> termed the MDSimCluster project and I will be updating the AMBER website
>> with details shortly. For now you can get info here:
>> http://www.nvidia.com/object/molecular-dynamics-simcluster.html
>>
>> This has 2 M2090 GPUs per node (M2050s or M2070s would also work) and a
>> single QDR IB card per node for up to 4 nodes and hence 8 GPUs. All are in
>> x16 slots an the performance should be equivalent to that shown on the
>> following page for the 2 GPU per node cases.
>> http://ambermd.org/gpus/benchmarks.htm#Benchmarks
>>
>> I would also suggest downloading the benchmarks from this page
>> http://ambermd.org/amber11_bench_files/Amber11_Benchmark_Suite.tar.gz and
>> using those for testing since they use settings more optimal for
>> production
>> runs.
>>
>> All the best
>> Ross
>>
>> /\
>> \/
>> |\oss Walker
>>
>> ---------------------------------------------------------
>> | Assistant Research Professor |
>> | San Diego Supercomputer Center |
>> | Adjunct Assistant Professor |
>> | Dept. of Chemistry and Biochemistry |
>> | University of California San Diego |
>> | NVIDIA Fellow |
>> | http://www.rosswalker.co.uk | http://www.wmd-lab.org/ |
>> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
>> ---------------------------------------------------------
>>
>> Note: Electronic Mail is not secure, has no guarantee of delivery, may not
>> be read every day, and should not be used for urgent or sensitive issues.
>>
>>
>>
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Sep 15 2011 - 17:30:03 PDT