Re: [AMBER] Sufficient CPU cores/GPU ratio ?

From: Marek Maly <marek.maly.ujep.cz>
Date: Sun, 18 Sep 2011 01:08:48 +0200

Hi Jodi,

thanks for your exhaustive answer including photos with melting details !

The problems which you described are probably connected with differences
in power requirements of C1060 and C2070.

http://en.wikipedia.org/wiki/Nvidia_Tesla#Specifications_and_configurations


However I would assume that insufficient PSU will cause just GPUs/CPU
errors but not
the melting of PSU connector ... but I am definitely not an expert here.

My Swiss colleagues have professional Tesla workstation ( SuperMicro
7046GT-TRF ) equipped with
4 x Tesla c2050. The PSU used is:

Redundant 1400W Gold Level (93%)

It worked fine more than year, but recently PSU "died" fortunately the
whole
workstation is under 3 year warranty :))

BTW It seems to me that if one would like to build safe/stable multi GPU
system (Tesla C2050/c2070 or GTX 580/590) based on "non-specialised"
motherboard, the
upper limit are probably 3 GPUs at the moment.

Would be interesting to know if guys from ASUS will send you new
  "P6T7 WS SuperComputer" or they choose different solution knowing about
that 4xTesla c2070 requirement.

Anyway I wish you lot of fun with your successfully repaired or new
awesome 4xc2070 workstation !


     Best wishes,

        Marek








Dne Fri, 16 Sep 2011 20:39:03 +0200 Jodi Ann Hadden <jodih.uga.edu>
napsal/-a:

> Hi Marek,
>
> Sorry for the delay in replying, our university server had some issues
> and lost a bunch of emails, so I am just now getting this.
>
> In regard to our "melting issue", yes, the melting just happened in the
> socket where the 24-pin from the PSU connects to the motherboard. I can
> send you some pictures off-list.
>
> We got the machine in January 2010, and it only had a single C1060 in it
> then. In the spring of this year, we got 2 additional C1060s. All along,
> the machine was totally solid, ran with no issues. Then over the summer,
> we replaced the 3x C1060s with 4x C2070s. Microway warned us that this
> would bring us close to the limits of our PSU (Cooler Master Real Power
> Pro 1250W), but since we only had 1 CPU they expected it should work.
> The machine actually spent some time as 2x C1060 and 2x C2070 with all
> running at once and did fine. However, when I put all the C2070s in, and
> had them all running at once for a few days, the computer started to
> shut down randomly. We assumed it was that we were indeed beyond the
> limits of our PSU, and we had Microway suggest an a more powerful one
> (Thermaltake 1350W). When I went to install the new PSU, that is when I
> discovered that the plastic on the 24-pin connector was melted off of a
> few of the pins into the socket. I wrote Microway again, and they said
> this should not have happened just from using an insufficient PSU, and
> asked that we send the machine in so they could fix it. After the
> machine was there a few weeks, Microway told us ASUS didn't want to
> honor the warranty on the motherboard because the socket burn was
> consistent with using incorrect cables, such as plugging a non-PCI-E
> cable into a PCI-E card. We still had the original power supply and
> could account for where everything was plugged in, and could confirm
> that the machine was indeed cabled correctly. I sent pictures of all
> this to Microway, who requested we then also send them the old PSU so
> they could better work with ASUS. That was a few days ago, and I haven't
> heard back from them yet.
>
> I do find it suspicious that the machine was fine for a year and a half
> and suddenly had this socket burn just when we cranked up 4x C2070s in
> it, but since Microway doesn't seem to think the motherboard, PSU, or 4x
> GPUs together should have caused the problem and agreed to fix the
> machine under warranty, I am still hoping we just had a lemon or
> something.... I hope to get a full explanation for what they think
> caused the damage, and can send you that information as well. I am
> really hoping we don't get the machine back with the same model
> motherboard just to have it burn out again when we run all 4 GPUs... I
> will keep you updated.
>
> Jodi
>
> On Sep 15, 2011, at 2:20 PM, Ross Walker wrote:
>
>>
>>
>> -----Original Message-----
>> From: Marek Maly [mailto:marek.maly.ujep.cz]
>> Sent: Wednesday, September 14, 2011 6:01 PM
>> To: AMBER Mailing List
>> Subject: Re: [AMBER] Sufficient CPU cores/GPU ratio ?
>>
>> Hi Scott,
>> thanks for the warning (but I can not afford TeslaC2050 GPUs ).
>> Fortunately I plan to use the new machine
>> rather for single GPU jobs however of course I will also try to
>> experiment
>> with parallel GPU runs but
>> probably just with 2GPUs per job where the scaling gain is the biggest
>> one
>> and where the
>> unstability which you mentioned will be maybe acceptably small. Much
>> more
>> serious seems to
>> me in this moment that melting issue and I am really curious about some
>> additional info from Jodi.
>>
>> Did you ever experienced any similar problem in some of your multi GPU
>> system ?
>>
>> Best wishes,
>>
>> Marek
>>
>>
>>
>> Dne Thu, 15 Sep 2011 02:51:33 +0200 Scott Le Grand
>> <varelse2005.gmail.com>
>> napsal/-a:
>>
>>> 3 GB GTX 580s rock for single GPU runs (as many as you can do in a
>>> single
>>> system) but are unstable in parallel runs...
>>>
>>> On Wed, Sep 14, 2011 at 5:15 PM, Marek Maly <marek.maly.ujep.cz> wrote:
>>>
>>>> Hello Jodi,
>>>>
>>>> first of all thanks a lot for sharing your experience with "P6T7 WS
>>>> SuperComputer" + 4 x GPU
>>>> and providing your benchmarks ! Your benchmark results are nice and as
>>>> I
>>>> checked
>>>> they are very similar to that in Amber web (
>>>> http://ambermd.org/gpus/benchmarks.htm )
>>>> which is really positive surprise for me considering some rather more
>>>> pessimistic prognoses
>>>> regarding parallel GPU runs ( > 2 GPU) on the single socket systems in
>>>> this discussion.
>>>>
>>>> Of course the melting issue of the power connection is a bit less
>>>> optimistic information :((
>>>>
>>>> Did I understood well, that melting took place just on the ending
>>>> which
>>>> is connected with motherboard ?
>>>>
>>>> How long it worked well before this "accident" ? Were during this "OK
>>>> period" successfully
>>>> done some longer (at least some days) runs during which all 4 GPUs
>>>> were
>>>> fully busy ?
>>>>
>>>> If it worked well before this problem, what is in your opinion
>>>> the reason of this actual melting issue ? For example could it be the
>>>> reason some extremely long
>>>> run on all 4 GPUs ? Failure of some cooling fan ? Do you have already
>>>> some
>>>> indications from "Microway/ASUS" regarding
>>>> this issue ?
>>>>
>>>> Which is specification of your PSU ?
>>>> (I am thinking about: SilverStone Strider Plus Series SST -ST1500
>>>> 1500W
>>>> )
>>>>
>>>> BTW just for curiosity regarding GPUs I am finally thinking about "MSI
>>>> N580GTX Lightning Xtreme Edition"
>>>> which seems to be actually the most powerful/precise 3GB GTX580 on
>>>> the
>>>> market.
>>>>
>>>> Thanks in advance for your eventual additional comments !
>>>>
>>>> Best wishes,
>>>>
>>>> Marek
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Dne Wed, 14 Sep 2011 17:50:21 +0200 Jodi Ann Hadden <jodih.uga.edu>
>>>> napsal/-a:
>>>>
>>>>> Hi Marek,
>>>>>
>>>>> I have a GPU machine with the motherboard you were interested in
>>>>> (ASUS
>>>>> P6T7 WS SuperComputer, single LGA1366 socket with the Intel X58
>>>>> chipset). It has an Intel Xeon W3520 Nehalem 2.66 GHz quad core CPU
>>>> and
>>>>> 4x NVIDIA Tesla C2070s. Below are some numbers (ns/day) for this
>>>> machine
>>>>> running a subset of the official AMBER benchmark suite so you can get
>>>> an
>>>>> idea of the speedup we get when going to all 4 GPUs for a single job.
>>>>>
>>>>> Benchmark 1xC2070 2xC2070 3xC2070 4xC2070
>>>>> GB/myoglobin 63.03 77.14 90.69 102.38
>>>>> GB/nucleosome 1.10 1.34 1.68 1.97
>>>>> GB/TRPCage 354.40 334.06 330.76 330.75
>>>>> PME/Cellulose_production_NPT 1.97 2.76 3.29 3.56
>>>>> PME/Cellulose_production_NVE 2.19 3.06 3.67 3.96
>>>>>
>>>>> As for the issue of cooling, this system is housed in a Lian Li
>>>> chassis
>>>>> (25x24.9x8.6) with three fans in the front, one in the back, and one
>>>> on
>>>>> top. I had also noted that the GPUs were getting extremely hot and
>>>>> contacted Microway, the company who assembled the machine for us. The
>>>>> assured me that they'd had experience with 4x Teslas in that chassis,
>>>>> and that cooling was sufficient.
>>>>>
>>>>> I will warn you, however, that we recently experienced a socket burn
>>>>> with this motherboard, where the 24-pin ATX power connection from the
>>>>> PSU to the motherboard had the plastic melt off of some of the pins.
>>>>> Microway/ASUS are replacing it for us under warranty, so hopefully we
>>>>> just had a lemon.
>>>>>
>>>>> Hope this helps,
>>>>>
>>>>> Jodi Hadden
>>>>> University of Georgia
>>>>>
>>>>> On Sep 13, 2011, at 1:17 PM, Marek Maly wrote:
>>>>>
>>>>> OK,
>>>>> thanks again ! If anyone has experience with Amber calculations on
>>>>> a single socket machine with a 4 core CPU equipped with 4 GPUs,
>>>>> please comment.
>>>>>
>>>>> Best wishes,
>>>>>
>>>>> Marek
>>>>>
>>>>>
>>>>>
>>>>> Dne Tue, 13 Sep 2011 19:25:20 +0200 Ross Walker
>>>>> <ross.rosswalker.co.uk<mailto:ross.rosswalker.co.uk>>
>>>>> napsal/-a:
>>>>>
>>>>> first of all thanks a lot for your complex answer !
>>>>> In fact I assume mainly independent single GPU jobs. So if I
>>>> understood
>>>>> well, in such case there should not be problem considering below
>>>>> mentioned
>>>>> motherboard/CPU/4xGPU. Am I right ?
>>>>>
>>>>> For single GPU runs (i.e. 4 independent jobs) then things should be
>>>> fine
>>>>> assuming the I/O can keep up etc. The caveat with that is that I have
>>>> not
>>>>> actually tried 4 GPUs in a single socket machine with a 4 core GPU so
>>>> am
>>>>> speaking from a theoretical standpoint here given how the AMBER GPU
>>>> code
>>>>> works. Someone else who is running such a system might want to chime
>>>> in
>>>>> with
>>>>> some specific performance numbers if they have them.
>>>>>
>>>>> All the best
>>>>> Ross
>>>>>
>>>>> /\
>>>>> \/
>>>>> |\oss Walker
>>>>>
>>>>> ---------------------------------------------------------
>>>>> | Assistant Research Professor |
>>>>> | San Diego Supercomputer Center |
>>>>> | Adjunct Assistant Professor |
>>>>> | Dept. of Chemistry and Biochemistry |
>>>>> | University of California San Diego |
>>>>> | NVIDIA Fellow |
>>>>> | http://www.rosswalker.co.uk | http://www.wmd-lab.org/ |
>>>>> | Tel: +1 858 822 0854 | EMail:-
>>>>> ross.rosswalker.co.uk<mailto:ross.rosswalker.co.uk> |
>>>>> ---------------------------------------------------------
>>>>>
>>>>> Note: Electronic Mail is not secure, has no guarantee of delivery,
>>>>> may
>>>>> not
>>>>> be read every day, and should not be used for urgent or sensitive
>>>> issues.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> AMBER mailing list
>>>>> AMBER.ambermd.org<mailto:AMBER.ambermd.org>
>>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>>
>>>>> __________ Informace od ESET NOD32 Antivirus, verze databaze 6459
>>>>> (20110913) __________
>>>>>
>>>>> Tuto zpravu proveril ESET NOD32 Antivirus.
>>>>>
>>>>> http://www.eset.cz
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>>>>> http://www.opera.com/mail/
>>>>>
>>>>> _______________________________________________
>>>>> AMBER mailing list
>>>>> AMBER.ambermd.org
>>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> AMBER mailing list
>>>>> AMBER.ambermd.org
>>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>>
>>>>> __________ Informace od ESET NOD32 Antivirus, verze databaze 6462
>>>>> (20110914) __________
>>>>>
>>>>> Tuto zpravu proveril ESET NOD32 Antivirus.
>>>>>
>>>>> http://www.eset.cz
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>>>> http://www.opera.com/mail/
>>>>
>>>> _______________________________________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org
>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>>> __________ Informace od ESET NOD32 Antivirus, verze databaze 6464
>>> (20110914) __________
>>>
>>> Tuto zpravu proveril ESET NOD32 Antivirus.
>>>
>>> http://www.eset.cz
>>>
>>>
>>>
>>
>>
>> --
>> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>> http://www.opera.com/mail/
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>>
>>
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
> __________ Informace od ESET NOD32 Antivirus, verze databaze 6469
> (20110916) __________
>
> Tuto zpravu proveril ESET NOD32 Antivirus.
>
> http://www.eset.cz
>
>
>


-- 
Tato zpráva byla vytvořena převratným poštovním klientem Opery:  
http://www.opera.com/mail/
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sat Sep 17 2011 - 16:30:02 PDT
Custom Search