Re: [AMBER] Wrong results in GTX TITAN, correct results on GTX580

From: Scott Le Grand <varelse2005.gmail.com>
Date: Thu, 11 Jul 2013 09:49:39 -0700

They downclocked memory, not the GPU. This fixed issues with cuFFT when
used in combination with a $100 heatsink. One of the AMBER issues on Titan
is cuFFT occasionally presenting incorrect results. I gave NVIDIA a repro
app for this. I'd love to see one of these modded GPUs running AMBER.







On Thu, Jul 11, 2013 at 9:16 AM, Marek Maly <marek.maly.ujep.cz> wrote:

> Hi Scott,
>
> did I understood well that improved cooling or
> downclocking of TITANs really solved the problem ?
>
> If downclocking solved the problem, is it really
> necessary to downclock it by factor 2 e.g. to something
> around 450 MHz ? This "drastic" downclocking should also
> significantly decrese GPU performance or am I wrong ?
>
> BTW I discussed Titan related things recently with Tru
>
> see here:
>
> http://archive.ambermd.org/201307/0225.html
> http://archive.ambermd.org/201307/0227.html
> http://archive.ambermd.org/201307/0229.html
> http://archive.ambermd.org/201307/0231.html
>
>
> and he wrote that his tesla K20m works at 705 or 758 MHz ? under
> temperature cca 40°C (even without fan).
>
> So I assumed that if downclocking helps here, to downclock to
> K20 frequencies = cca 700 MHz should be sufficient here.
>
> Anyway as you can see in above links we discussed mainly memory overheating
> hypothesis. I had there some objections like strange selectivity (NEVER
> any problems with
> FACTOR IX and always problems with JAC under CUDA 5.0 or 5.5) - how to
> explain this
> if the issue is memory overheating related ? The similar with GB
> (TRPcage/myoglobin).
> I did also recently longer cca (12 hours) memtestG80 testing of one my
> titan. As I already reported
> the GPU temperature is during this testing the same like in case of Amber
> jobs = 80°C and
> I did not obtain any errs (5GB of GPU memory was tested using 5000
> memtestG80 iterations).
>
> So my conclusion was that the cause of the Titan problem might be too high
> frequency (
> from some reson too high for correct working of cuFFT ) but not
> necessarily overheating it self.
>
> But OK if the enhanced cooling helped here this is direct prove that the
> overheating
> is the really the main cause. Maybe this high temperature is not a problem
> for the
> GPU memory it self (otherwise I would expect some errs during memtestg80
> tests) maybe
> that just overheated memory is cause of overheating of some another
> components "logic units"
> (but some other than are checked with memtestG80) and overheating of these
> units is critical
> for some operations (e.g. those done during FFT calc.) ? But this still do
> not explain
> that selectivity pointed above.
>
> Anyway if this overheating theory was now proved, which target temperature
> was found
> as the "critical" one = below it GPU works without errs and above it
> starts problems ?
>
> (My guess is that it might be around that K20 working one = cca 40°C )
>
> Thanks,
>
> Best,
>
> Marek
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Dne Thu, 11 Jul 2013 15:56:12 +0200 Scott Le Grand <varelse2005.gmail.com>
> napsal/-a:
>
> > The problem with memtest is that it just exercises memory. Memory on its
> > own is usually fine. What seems to be going on is that the memory starts
> > giving errors when the GPU heats up the system while number-crunching.
> > i
> > have found all sorts of GPUs, from all AMBER-compatible generations, that
> > pass memtest and even run n-body, only to blow up running JAC NVE. for 2
> > minutes.
> >
> > We know Titan has issues and I suspect at least some 780s will as well.
> > What's interesting is the people who have gotten around this by modding
> > their GPU heatsinks and downclocking memory by a factor of 2. Not for
> > the
> > faint of heart, but a lot cheaper than buying Teslas. That said, I
> > suspect
> > the future lead to contexts where Teslas better prove their worth running
> > AMBER.
> >
> > Scott
> >
> >
> >
> >
> > On Thu, Jul 11, 2013 at 5:05 AM, ET <sketchfoot.gmail.com> wrote:
> >
> >> I never got any memtest errors with my TITANS either, so it is a
> >> combination of hardware and the CUDA code IMO,
> >>
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
> > __________ Informace od ESET NOD32 Antivirus, verze databaze 8553
> > (20130711) __________
> >
> > Tuto zpravu proveril ESET NOD32 Antivirus.
> >
> > http://www.eset.cz
> >
> >
> >
>
>
> --
> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
> http://www.opera.com/mail/
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jul 11 2013 - 10:00:05 PDT
Custom Search