Re: [AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ?

From: Marek Maly <marek.maly.ujep.cz>
Date: Mon, 03 Jun 2013 20:47:36 +0200

Thanks Scott !

sounds me like "Of course you can win gold treasure if you survive Russian
roulette before ..."

It seems that the difference in reliability for sci. calc. between Teslas
and "equivalent" stock GTXs
is now (with chip GTK110) clearly bigger. I am curious how it will be with
GTX 780 comparing to Titans.

So let's hope that in the worst case downclocking of Titans might solve
the problem.

BTW what is the working temperature of your K20c ? My Titans works under
80°C (cca
60% Fan utilization). For the older cards (GTX 680/580 ...) this temp.
should be OK but
maybe for the GTK110 this temp is already too high to ensure zero "bit
fluctuations".

cuFFT is maybe responsible for crashes and maybe also some
irreproducibility but the irreproducibility of the results will have also
some another source as suggests
NUCLEOSOME GB test where perhaps no FFT is involved ? (just the real
space calc.).

   So thanks for the moment and please let us know when you do some
progress.


        M.



Dne Mon, 03 Jun 2013 20:12:04 +0200 Scott Le Grand <varelse2005.gmail.com>
napsal/-a:

> Addressing Divi's two points:
>
> 1. We're trying to find a way to do this...
>
> 2. I am extremely paranoid and while I would still use the Titans for
> development and testing, I would also currently do my publishable runs on
> GK104 GPUs or K20s. Given that, if you're comfortable with
> nondeterministic execution ala GROMACS, ACEMD, and NAMD, what's going on
> here is seemingly no worse. I'm *not* comfortable with that myself and I
> intend to find a fix or workaround like we did a couple years ago with
> GTX4xx and GTX5xx. So your best strategy might just be to wait a week or
> two and see what comes of the bug hunt.
>
> Marek et al. if these GPU tests are failing on the Titans, then by all
> means return them without hesitation, but I don't think consumer level
> GPUs
> are tested with the same level of rigor as Teslas. The upside is you get
> 30% better performance for 1/3 the price. The downside is that IMO you
> should be carefully validate them before using them. What I'm seeing
> here
> looks like single bit differences at the low-order bits that cause a tiny
> fluctuation that ultimately mushrooms and diverges the whole shebang
> along
> with occasional crashes. The crashes seem to occur in cuFFT somewhere.
> I
> have yet to see divergence there yet.
>
> Scott
>
>
> On Mon, Jun 3, 2013 at 9:42 AM, Marek Maly <marek.maly.ujep.cz> wrote:
>
>> Hi,
>> so here are my NUCLEOSOME test results. All tests finished (although the
>> TITAN_0/ROUND_2) with "****" energy (*** records starts from the 75K
>> step
>> so
>> it is surprise for me that test was finished at the end). All the
>> results
>> are irreproducible (driver 319.23, Amber12 bugfix 18 applied, cuda 5.5)
>> I
>> will
>> repeat it with CUDA 5.0.
>>
>> M.
>>
>> >>>>>> TITAN_0
>>
>>
>> ROUND_1
>>
>> ------------------------------------------------------------------------------
>>
>>
>> NSTEP = 100000 TIME(PS) = 300.000 TEMP(K) = 310.60 PRESS
>> = 0.0
>> Etot = -66843.8345 EKtot = 19690.5156 EPtot =
>> -86534.3502
>> BOND = 5887.3611 ANGLE = 13673.5215 DIHED =
>> 16941.7678
>> 1-4 NB = 5576.6911 1-4 EEL = 1371.5924 VDWAALS =
>> -13647.8461
>> EELEC = -14410.1252 EGB = -102286.9459 RESTRAINT =
>> 359.6331
>> EAMBER (non-restraint) = -86893.9832
>>
>> ------------------------------------------------------------------------------
>>
>> ROUND_2
>>
>> ------------------------------------------------------------------------------
>>
>>
>> NSTEP = 100000 TIME(PS) = 300.000 TEMP(K) =********* PRESS
>> = 0.0
>> Etot = ************** EKtot = ************** EPtot =
>> 4279668.7807
>> BOND = -0.0000 ANGLE = 4681740.3488 DIHED =
>> 67661.6797
>> 1-4 NB = -0.0000 1-4 EEL = -2.0373 VDWAALS =
>> 244.1012
>> EELEC = 72548.4049 EGB = -542523.7166 RESTRAINT =
>> -0.0000
>> EAMBER (non-restraint) = 4279668.7807
>>
>> ------------------------------------------------------------------------------
>> STARS from the 75k step ...
>>
>>
>> >>>>>> TITAN_1
>>
>>
>> ROUND_1
>>
>> ------------------------------------------------------------------------------
>>
>>
>> NSTEP = 100000 TIME(PS) = 300.000 TEMP(K) = 310.36 PRESS
>> = 0.0
>> Etot = -66846.8801 EKtot = 19675.0488 EPtot =
>> -86521.9289
>> BOND = 5760.2422 ANGLE = 13619.8710 DIHED =
>> 16996.9045
>> 1-4 NB = 5645.6416 1-4 EEL = 1774.6967 VDWAALS =
>> -13622.9343
>> EELEC = -14168.1788 EGB = -102880.8089 RESTRAINT =
>> 352.6371
>> EAMBER (non-restraint) = -86874.5660
>>
>> ------------------------------------------------------------------------------
>>
>> ROUND_2
>>
>> ------------------------------------------------------------------------------
>>
>>
>> NSTEP = 100000 TIME(PS) = 300.000 TEMP(K) = 311.00 PRESS
>> = 0.0
>> Etot = -66874.9016 EKtot = 19715.3633 EPtot =
>> -86590.2649
>> BOND = 5819.0667 ANGLE = 13683.6633 DIHED =
>> 16918.8596
>> 1-4 NB = 5627.0932 1-4 EEL = 1576.9564 VDWAALS =
>> -13747.1032
>> EELEC = -15232.3280 EGB = -101590.5078 RESTRAINT =
>> 354.0348
>> EAMBER (non-restraint) = -86944.2997
>>
>> ------------------------------------------------------------------------------
>>
>>
>>
>>
>>
>>
>>
>>
>> Dne Mon, 03 Jun 2013 12:34:15 +0200 Marek Maly <marek.maly.ujep.cz>
>> napsal/-a:
>>
>> > OK, I will try NUCLEOSOME case as well with my latest
>> > settings : (driver 319.23, Amber12 bugfix 18 applied, cuda 5.5)
>> >
>> > M.
>> >
>> >
>> >
>> >
>> > Dne Mon, 03 Jun 2013 11:51:46 +0200 ET <sketchfoot.gmail.com>
>> napsal/-a:
>> >
>> >> Hi all,
>> >>
>> >> I reran the benchmark with Amber recompiled and at the latest drivers
>> >> with
>> >> GPU in solo configuration yields the following results:
>> >>
>> >>
>> >> When I run the tests on GPU-00_TeaNCake:
>> >>
>> >> 1) All the tests (across 2x repeats) finish successfully:
>> >>
>> >>
>> >> 2) The sdiff logs indicate that reproducibility across the two
>> repeats
>> >> is
>> >> as follows:
>> >>
>> >> GB_myoglobin: Reproducible across 1,000,000 steps
>> >> GB_nucleosome: No reproducibility shown from step 3,400 onwards. Also
>> >> outfile is not written properly - blank gaps appear where something
>> >> should
>> >> have been written.
>> >> GB_TRPCage: Reproducible across 1,000,000 steps
>> >>
>> >> PME_JAC_production_NVE: No reproducibility shown from step 35,000
>> >> onwards.
>> >> Also outfile is not written properly - blank gaps appear where
>> something
>> >> should have been written.
>> >> PME_JAC_production_NPT: No reproducibility shown from step 69,000
>> >> onwards.
>> >> Also outfile is not written properly - blank gaps appear where
>> something
>> >> should have been written.
>> >> PME_FactorIX_production_NVE: Reproducible across 100k steps
>> >> PME_FactorIX_production_NPT: Reproducible across 100k steps
>> >> PME_Cellulose_production_NVE: Reproducible across 100k steps
>> >> PME_Cellulose_production_NPT: No reproducibility shown from step
>> 17,000
>> >> onwards. Also outfile is not written properly - blank gaps appear
>> where
>> >> something should have been written.
>> >>
>> >> #################################################
>> >>
>> >>
>> >> So it looks like the problem does occur in GB runs too. Though I
>> notice
>> >> that running in single GPU mode seems to make the problem appear much
>> >> later
>> >> than it occurs with dual GPUs, though obviously this is quite
>> >> qualitative
>> >> and based only of 1 repeat.
>> >>
>> >> br,
>> >> g
>> >>
>> >>
>> >>
>> >>
>> >> On 3 June 2013 10:28, ET <sketchfoot.gmail.com> wrote:
>> >>
>> >>> Hi Marek,
>> >>>
>> >>> I think what you say about Valley and Heaven are true to a certain
>> >>> extent,
>> >>> but I think the links I posted to the EVGA overclock utility & MSI
>> >>> Kombuster are very good ways of testing the card. I don't know the
>> >>> details
>> >>> of memtestG80 and cuda_memtest, but it seems to me that they are
>> >>> testing
>> >>> one very specific component. i.e. The Memory. As the graphics card
>> >>> consists
>> >>> of more than this, it is better to have a test that checks the card
>> in
>> >>> a
>> >>> more holistic manner IMO. :)
>> >>>
>> >>> I think this argument is supported by the fact that tech support at
>> the
>> >>> store used a program called FurMark to stress test the GPU. As the
>> GPU
>> >>> I
>> >>> returned kept failing the benchmark, they realized in less than
>> half a
>> >>> day
>> >>> it was faulty, whilst I wasted a couple of days mucking about with
>> GPU
>> >>> memory tests using Gpuburn on linux.
>> >>>
>> >>> http://www.ozone3d.net/benchmarks/fur/
>> >>>
>> >>> I think if you are going to test on windows, you are better of
>> getting
>> >>> MSI
>> >>> Kombuster which I posted earlier. It contains the test contained in
>> >>> Furmark
>> >>> and many additional tests that test the compute capability of the
>> card.
>> >>>
>> >>> best regards,
>> >>> g
>> >>>
>> >> _______________________________________________
>> >> AMBER mailing list
>> >> AMBER.ambermd.org
>> >> http://lists.ambermd.org/mailman/listinfo/amber
>> >>
>> >> __________ Informace od ESET NOD32 Antivirus, verze databaze 8405
>> >> (20130603) __________
>> >>
>> >> Tuto zpravu proveril ESET NOD32 Antivirus.
>> >>
>> >> http://www.eset.cz
>> >>
>> >>
>> >>
>> >
>> >
>>
>>
>> --
>> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>> http://www.opera.com/mail/
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
> __________ Informace od ESET NOD32 Antivirus, verze databaze 8407
> (20130603) __________
>
> Tuto zpravu proveril ESET NOD32 Antivirus.
>
> http://www.eset.cz
>
>
>


-- 
Tato zpráva byla vytvořena převratným poštovním klientem Opery:  
http://www.opera.com/mail/
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Jun 03 2013 - 12:30:03 PDT
Custom Search