Re: [AMBER] Wrong results in GTX TITAN, correct results on GTX580

From: Marek Maly <>
Date: Thu, 11 Jul 2013 18:16:49 +0200

Hi Scott,

did I understood well that improved cooling or
downclocking of TITANs really solved the problem ?

If downclocking solved the problem, is it really
necessary to downclock it by factor 2 e.g. to something
around 450 MHz ? This "drastic" downclocking should also
significantly decrese GPU performance or am I wrong ?

BTW I discussed Titan related things recently with Tru

see here:

and he wrote that his tesla K20m works at 705 or 758 MHz ? under
  temperature cca 40°C (even without fan).

So I assumed that if downclocking helps here, to downclock to
K20 frequencies = cca 700 MHz should be sufficient here.

Anyway as you can see in above links we discussed mainly memory overheating
hypothesis. I had there some objections like strange selectivity (NEVER
any problems with
FACTOR IX and always problems with JAC under CUDA 5.0 or 5.5) - how to
explain this
if the issue is memory overheating related ? The similar with GB
I did also recently longer cca (12 hours) memtestG80 testing of one my
titan. As I already reported
the GPU temperature is during this testing the same like in case of Amber
jobs = 80°C and
I did not obtain any errs (5GB of GPU memory was tested using 5000
memtestG80 iterations).

So my conclusion was that the cause of the Titan problem might be too high
frequency (
 from some reson too high for correct working of cuFFT ) but not
necessarily overheating it self.

But OK if the enhanced cooling helped here this is direct prove that the
is the really the main cause. Maybe this high temperature is not a problem
for the
GPU memory it self (otherwise I would expect some errs during memtestg80
tests) maybe
that just overheated memory is cause of overheating of some another
components "logic units"
(but some other than are checked with memtestG80) and overheating of these
units is critical
for some operations (e.g. those done during FFT calc.) ? But this still do
not explain
that selectivity pointed above.

Anyway if this overheating theory was now proved, which target temperature
was found
as the "critical" one = below it GPU works without errs and above it
starts problems ?

(My guess is that it might be around that K20 working one = cca 40°C )




Dne Thu, 11 Jul 2013 15:56:12 +0200 Scott Le Grand <>

> The problem with memtest is that it just exercises memory. Memory on its
> own is usually fine. What seems to be going on is that the memory starts
> giving errors when the GPU heats up the system while number-crunching.
> i
> have found all sorts of GPUs, from all AMBER-compatible generations, that
> pass memtest and even run n-body, only to blow up running JAC NVE. for 2
> minutes.
> We know Titan has issues and I suspect at least some 780s will as well.
> What's interesting is the people who have gotten around this by modding
> their GPU heatsinks and downclocking memory by a factor of 2. Not for
> the
> faint of heart, but a lot cheaper than buying Teslas. That said, I
> suspect
> the future lead to contexts where Teslas better prove their worth running
> Scott
> On Thu, Jul 11, 2013 at 5:05 AM, ET <> wrote:
>> I never got any memtest errors with my TITANS either, so it is a
>> combination of hardware and the CUDA code IMO,
> _______________________________________________
> AMBER mailing list
