Re: [AMBER] Wrong results in GTX TITAN, correct results on GTX580

From: Marek Maly <marek.maly.ujep.cz>
Date: Thu, 11 Jul 2013 18:53:25 +0200

Thanks Scott ! and sorry for my
misunderstanding. To be frank I was never
thinking about memory frequency in our context ...

So if I understood well they did just this transformation :

6008MHz -> 3004MHz in memory frequency while they did not
change base GPU frequency (= 837 MHz in non-OC Titan types ).

Would be interesting to see the performance effect after such
memory downclocking.

Anyway thanks for this update, it seems that things are moving but
perhaps is still too early for any final conclusions/suggestions.

BTW when you have in your hands some Amber benchmark results
using that modded GPUs, let us know.

   Best wishes,

      Marek







Dne Thu, 11 Jul 2013 18:49:39 +0200 Scott Le Grand <varelse2005.gmail.com>
napsal/-a:

> They downclocked memory, not the GPU. This fixed issues with cuFFT when
> used in combination with a $100 heatsink. One of the AMBER issues on
> Titan
> is cuFFT occasionally presenting incorrect results. I gave NVIDIA a
> repro
> app for this. I'd love to see one of these modded GPUs running AMBER.
>
>
>
>
>
>
>
> On Thu, Jul 11, 2013 at 9:16 AM, Marek Maly <marek.maly.ujep.cz> wrote:
>
>> Hi Scott,
>>
>> did I understood well that improved cooling or
>> downclocking of TITANs really solved the problem ?
>>
>> If downclocking solved the problem, is it really
>> necessary to downclock it by factor 2 e.g. to something
>> around 450 MHz ? This "drastic" downclocking should also
>> significantly decrese GPU performance or am I wrong ?
>>
>> BTW I discussed Titan related things recently with Tru
>>
>> see here:
>>
>> http://archive.ambermd.org/201307/0225.html
>> http://archive.ambermd.org/201307/0227.html
>> http://archive.ambermd.org/201307/0229.html
>> http://archive.ambermd.org/201307/0231.html
>>
>>
>> and he wrote that his tesla K20m works at 705 or 758 MHz ? under
>> temperature cca 40°C (even without fan).
>>
>> So I assumed that if downclocking helps here, to downclock to
>> K20 frequencies = cca 700 MHz should be sufficient here.
>>
>> Anyway as you can see in above links we discussed mainly memory
>> overheating
>> hypothesis. I had there some objections like strange selectivity (NEVER
>> any problems with
>> FACTOR IX and always problems with JAC under CUDA 5.0 or 5.5) - how to
>> explain this
>> if the issue is memory overheating related ? The similar with GB
>> (TRPcage/myoglobin).
>> I did also recently longer cca (12 hours) memtestG80 testing of one my
>> titan. As I already reported
>> the GPU temperature is during this testing the same like in case of
>> Amber
>> jobs = 80°C and
>> I did not obtain any errs (5GB of GPU memory was tested using 5000
>> memtestG80 iterations).
>>
>> So my conclusion was that the cause of the Titan problem might be too
>> high
>> frequency (
>> from some reson too high for correct working of cuFFT ) but not
>> necessarily overheating it self.
>>
>> But OK if the enhanced cooling helped here this is direct prove that the
>> overheating
>> is the really the main cause. Maybe this high temperature is not a
>> problem
>> for the
>> GPU memory it self (otherwise I would expect some errs during memtestg80
>> tests) maybe
>> that just overheated memory is cause of overheating of some another
>> components "logic units"
>> (but some other than are checked with memtestG80) and overheating of
>> these
>> units is critical
>> for some operations (e.g. those done during FFT calc.) ? But this still
>> do
>> not explain
>> that selectivity pointed above.
>>
>> Anyway if this overheating theory was now proved, which target
>> temperature
>> was found
>> as the "critical" one = below it GPU works without errs and above it
>> starts problems ?
>>
>> (My guess is that it might be around that K20 working one = cca 40°C )
>>
>> Thanks,
>>
>> Best,
>>
>> Marek
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Dne Thu, 11 Jul 2013 15:56:12 +0200 Scott Le Grand
>> <varelse2005.gmail.com>
>> napsal/-a:
>>
>> > The problem with memtest is that it just exercises memory. Memory on
>> its
>> > own is usually fine. What seems to be going on is that the memory
>> starts
>> > giving errors when the GPU heats up the system while number-crunching.
>> > i
>> > have found all sorts of GPUs, from all AMBER-compatible generations,
>> that
>> > pass memtest and even run n-body, only to blow up running JAC NVE.
>> for 2
>> > minutes.
>> >
>> > We know Titan has issues and I suspect at least some 780s will as
>> well.
>> > What's interesting is the people who have gotten around this by
>> modding
>> > their GPU heatsinks and downclocking memory by a factor of 2. Not for
>> > the
>> > faint of heart, but a lot cheaper than buying Teslas. That said, I
>> > suspect
>> > the future lead to contexts where Teslas better prove their worth
>> running
>> > AMBER.
>> >
>> > Scott
>> >
>> >
>> >
>> >
>> > On Thu, Jul 11, 2013 at 5:05 AM, ET <sketchfoot.gmail.com> wrote:
>> >
>> >> I never got any memtest errors with my TITANS either, so it is a
>> >> combination of hardware and the CUDA code IMO,
>> >>
>> > _______________________________________________
>> > AMBER mailing list
>> > AMBER.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber
>> >
>> > __________ Informace od ESET NOD32 Antivirus, verze databaze 8553
>> > (20130711) __________
>> >
>> > Tuto zpravu proveril ESET NOD32 Antivirus.
>> >
>> > http://www.eset.cz
>> >
>> >
>> >
>>
>>
>> --
>> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>> http://www.opera.com/mail/
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
> __________ Informace od ESET NOD32 Antivirus, verze databaze 8554
> (20130711) __________
>
> Tuto zpravu proveril ESET NOD32 Antivirus.
>
> http://www.eset.cz
>
>
>


-- 
Tato zpráva byla vytvořena převratným poštovním klientem Opery:  
http://www.opera.com/mail/
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jul 11 2013 - 10:30:02 PDT
Custom Search