Re: [AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ?

From: Marek Maly <marek.maly.ujep.cz>
Date: Tue, 04 Jun 2013 06:14:28 +0200

Hi Scott,

I am sending again my very first tests/table (see attached) where
I did also GTX 580/GTX 680 tests as a control and as you can see
here I have obtained perfect reproducibility on those GTX but also
on my second TITAN card (TITAN_1) for NUCLEOSOME ! But that was with
driver 319.17
(and also before bugfix 18).

Now I will try on my titans again with ntpr=1 as you wish
(driver 319.23, Amber12 bugfix 18 applied, cuda 5.0).

Simultaneously I will repeat this test on GTX 580 with ntpr=1000
(driver 319.23, Amber12 bugfix 18 applied, cuda 5.0).

BTW I also experimented a bit, first try to use some settings from
NUCLEOSOME (e.g. igb=5, ntt=1/3, saltcon=0.1, tautp=1.0 + restrains) and
use it
for TRP cage and Myoglob. assuming these params which are different
between NUCLE and TRP + MYO will affect the TRP + MYO reproducibility.

This was not confirmed i.e. TRP + MYO still perfectly reproducible.

So then (to be sure) I did opposite exper. and used TRP mdin file for
NUCLEOSOME to see
if it influence NUCL reproducibility, but in agreement with "TRP-MYO"
tests NUCL
was again irreproducible ...

So let's see the ntpr tests.

   M.




Dne Tue, 04 Jun 2013 04:51:08 +0200 Scott Le Grand <varelse2005.gmail.com>
napsal/-a:

> Update: The nucleosome GB irreproducibility is weird. it goes away on my
> Titan if I set ntpr to 1 (was trying to find the offending energy
> component
> that diverges first). Can you guys try this on your machines? I think
> this might be SW...
>
>
>
>
>
>
> On Mon, Jun 3, 2013 at 1:18 PM, ET <sketchfoot.gmail.com> wrote:
>
>> Hi Scott & Ross,
>>
>> I take it you will post to this thread once a fix has been found? :)
>>
>> br,
>> g
>>
>>
>> On 3 June 2013 20:31, Marek Maly <marek.maly.ujep.cz> wrote:
>>
>> > OK,
>> > I just took deep breath and started to pray :))
>> >
>> > BTW, the difference between GB results TRPcage/myoglobin (perfectly
>> > reproducible)
>> > versus Nucleosome (irreproducible res.) might be connected with some
>> > differences
>> > in mdin parameters:
>> >
>> > TRPcage/myoglobin (igb=1, ntt=3) versus Nucleosome (igb=5, ntt=1).
>> > Nucleosome simul. is also
>> > with restraint:
>> >
>> > RESTRAIN DNA
>> > 0.1
>> > RES 1 294
>> > END
>> > END
>> >
>> > I will try to experiment here to learn which parameter is responsible
>> for
>> > the
>> > Nucleosome irreproducible results.
>> >
>> > M.
>> >
>> >
>> >
>> >
>> >
>> > Dne Mon, 03 Jun 2013 21:17:23 +0200 Ross Walker
>> <ross.rosswalker.co.uk>
>> > napsal/-a:
>> >
>> > > Hi Marek,
>> > >
>> > > To be honest I would just take a deep breath and give us some time
>> to
>> > > figure out what is going on with the Titan and work around it.
>> Hopefully
>> > > this won't take too long and we can have a patch out shortly.
>> > >
>> > > All the best
>> > > Ross
>> > >
>> > >
>> > >
>> > > On 6/3/13 11:47 AM, "Marek Maly" <marek.maly.ujep.cz> wrote:
>> > >
>> > >> Thanks Scott !
>> > >>
>> > >> sounds me like "Of course you can win gold treasure if you survive
>> > >> Russian
>> > >> roulette before ..."
>> > >>
>> > >> It seems that the difference in reliability for sci. calc. between
>> > >> Teslas
>> > >>
>> > >> and "equivalent" stock GTXs
>> > >> is now (with chip GTK110) clearly bigger. I am curious how it will
>> be
>> > >> with
>> > >> GTX 780 comparing to Titans.
>> > >>
>> > >> So let's hope that in the worst case downclocking of Titans might
>> solve
>> > >> the problem.
>> > >>
>> > >> BTW what is the working temperature of your K20c ? My Titans works
>> under
>> > >> 80°C (cca
>> > >> 60% Fan utilization). For the older cards (GTX 680/580 ...) this
>> temp.
>> > >> should be OK but
>> > >> maybe for the GTK110 this temp is already too high to ensure zero
>> "bit
>> > >> fluctuations".
>> > >>
>> > >> cuFFT is maybe responsible for crashes and maybe also some
>> > >> irreproducibility but the irreproducibility of the results will
>> have
>> > >> also
>> > >>
>> > >> some another source as suggests
>> > >> NUCLEOSOME GB test where perhaps no FFT is involved ? (just the
>> real
>> > >> space calc.).
>> > >>
>> > >> So thanks for the moment and please let us know when you do some
>> > >> progress.
>> > >>
>> > >>
>> > >> M.
>> > >>
>> > >>
>> > >>
>> > >> Dne Mon, 03 Jun 2013 20:12:04 +0200 Scott Le Grand
>> > >> <varelse2005.gmail.com>
>> > >> napsal/-a:
>> > >>
>> > >>> Addressing Divi's two points:
>> > >>>
>> > >>> 1. We're trying to find a way to do this...
>> > >>>
>> > >>> 2. I am extremely paranoid and while I would still use the Titans
>> for
>> > >>> development and testing, I would also currently do my publishable
>> runs
>> > >>> on
>> > >>> GK104 GPUs or K20s. Given that, if you're comfortable with
>> > >>> nondeterministic execution ala GROMACS, ACEMD, and NAMD, what's
>> going
>> > >>> on
>> > >>> here is seemingly no worse. I'm *not* comfortable with that
>> myself
>> and
>> > >>> I
>> > >>> intend to find a fix or workaround like we did a couple years ago
>> with
>> > >>> GTX4xx and GTX5xx. So your best strategy might just be to wait a
>> week
>> > >>> or
>> > >>> two and see what comes of the bug hunt.
>> > >>>
>> > >>> Marek et al. if these GPU tests are failing on the Titans, then by
>> all
>> > >>> means return them without hesitation, but I don't think consumer
>> level
>> > >>> GPUs
>> > >>> are tested with the same level of rigor as Teslas. The upside is
>> you
>> > >>> get
>> > >>> 30% better performance for 1/3 the price. The downside is that
>> IMO
>> you
>> > >>> should be carefully validate them before using them. What I'm
>> seeing
>> > >>> here
>> > >>> looks like single bit differences at the low-order bits that
>> cause a
>> > >>> tiny
>> > >>> fluctuation that ultimately mushrooms and diverges the whole
>> shebang
>> > >>> along
>> > >>> with occasional crashes. The crashes seem to occur in cuFFT
>> somewhere.
>> > >>>
>> > >>> I
>> > >>> have yet to see divergence there yet.
>> > >>>
>> > >>> Scott
>> > >>>
>> > >>>
>> > >>> On Mon, Jun 3, 2013 at 9:42 AM, Marek Maly <marek.maly.ujep.cz>
>> wrote:
>> > >>>
>> > >>>> Hi,
>> > >>>> so here are my NUCLEOSOME test results. All tests finished
>> (although
>> > >>>> the
>> > >>>> TITAN_0/ROUND_2) with "****" energy (*** records starts from the
>> 75K
>> > >>>> step
>> > >>>> so
>> > >>>> it is surprise for me that test was finished at the end). All
>> the
>> > >>>> results
>> > >>>> are irreproducible (driver 319.23, Amber12 bugfix 18 applied,
>> cuda
>> > >>>> 5.5)
>> > >>>> I
>> > >>>> will
>> > >>>> repeat it with CUDA 5.0.
>> > >>>>
>> > >>>> M.
>> > >>>>
>> > >>>> >>>>>> TITAN_0
>> > >>>>
>> > >>>>
>> > >>>> ROUND_1
>> > >>>>
>> > >>>>
>> > >>>>
>> >
>> ------------------------------------------------------------------------
>> > >>>> ------
>> > >>>>
>> > >>>>
>> > >>>> NSTEP = 100000 TIME(PS) = 300.000 TEMP(K) = 310.60
>> PRESS
>> > >>>> = 0.0
>> > >>>> Etot = -66843.8345 EKtot = 19690.5156 EPtot =
>> > >>>> -86534.3502
>> > >>>> BOND = 5887.3611 ANGLE = 13673.5215 DIHED =
>> > >>>> 16941.7678
>> > >>>> 1-4 NB = 5576.6911 1-4 EEL = 1371.5924 VDWAALS =
>> > >>>> -13647.8461
>> > >>>> EELEC = -14410.1252 EGB = -102286.9459 RESTRAINT =
>> > >>>> 359.6331
>> > >>>> EAMBER (non-restraint) = -86893.9832
>> > >>>>
>> > >>>>
>> > >>>>
>> >
>> ------------------------------------------------------------------------
>> > >>>> ------
>> > >>>>
>> > >>>> ROUND_2
>> > >>>>
>> > >>>>
>> > >>>>
>> >
>> ------------------------------------------------------------------------
>> > >>>> ------
>> > >>>>
>> > >>>>
>> > >>>> NSTEP = 100000 TIME(PS) = 300.000 TEMP(K) =*********
>> PRESS
>> > >>>> = 0.0
>> > >>>> Etot = ************** EKtot = ************** EPtot =
>> > >>>> 4279668.7807
>> > >>>> BOND = -0.0000 ANGLE = 4681740.3488 DIHED =
>> > >>>> 67661.6797
>> > >>>> 1-4 NB = -0.0000 1-4 EEL = -2.0373 VDWAALS =
>> > >>>> 244.1012
>> > >>>> EELEC = 72548.4049 EGB = -542523.7166 RESTRAINT =
>> > >>>> -0.0000
>> > >>>> EAMBER (non-restraint) = 4279668.7807
>> > >>>>
>> > >>>>
>> > >>>>
>> >
>> ------------------------------------------------------------------------
>> > >>>> ------
>> > >>>> STARS from the 75k step ...
>> > >>>>
>> > >>>>
>> > >>>> >>>>>> TITAN_1
>> > >>>>
>> > >>>>
>> > >>>> ROUND_1
>> > >>>>
>> > >>>>
>> > >>>>
>> >
>> ------------------------------------------------------------------------
>> > >>>> ------
>> > >>>>
>> > >>>>
>> > >>>> NSTEP = 100000 TIME(PS) = 300.000 TEMP(K) = 310.36
>> PRESS
>> > >>>> = 0.0
>> > >>>> Etot = -66846.8801 EKtot = 19675.0488 EPtot =
>> > >>>> -86521.9289
>> > >>>> BOND = 5760.2422 ANGLE = 13619.8710 DIHED =
>> > >>>> 16996.9045
>> > >>>> 1-4 NB = 5645.6416 1-4 EEL = 1774.6967 VDWAALS =
>> > >>>> -13622.9343
>> > >>>> EELEC = -14168.1788 EGB = -102880.8089 RESTRAINT =
>> > >>>> 352.6371
>> > >>>> EAMBER (non-restraint) = -86874.5660
>> > >>>>
>> > >>>>
>> > >>>>
>> >
>> ------------------------------------------------------------------------
>> > >>>> ------
>> > >>>>
>> > >>>> ROUND_2
>> > >>>>
>> > >>>>
>> > >>>>
>> >
>> ------------------------------------------------------------------------
>> > >>>> ------
>> > >>>>
>> > >>>>
>> > >>>> NSTEP = 100000 TIME(PS) = 300.000 TEMP(K) = 311.00
>> PRESS
>> > >>>> = 0.0
>> > >>>> Etot = -66874.9016 EKtot = 19715.3633 EPtot =
>> > >>>> -86590.2649
>> > >>>> BOND = 5819.0667 ANGLE = 13683.6633 DIHED =
>> > >>>> 16918.8596
>> > >>>> 1-4 NB = 5627.0932 1-4 EEL = 1576.9564 VDWAALS =
>> > >>>> -13747.1032
>> > >>>> EELEC = -15232.3280 EGB = -101590.5078 RESTRAINT =
>> > >>>> 354.0348
>> > >>>> EAMBER (non-restraint) = -86944.2997
>> > >>>>
>> > >>>>
>> > >>>>
>> >
>> ------------------------------------------------------------------------
>> > >>>> ------
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>> Dne Mon, 03 Jun 2013 12:34:15 +0200 Marek Maly
>> <marek.maly.ujep.cz>
>> > >>>> napsal/-a:
>> > >>>>
>> > >>>> > OK, I will try NUCLEOSOME case as well with my latest
>> > >>>> > settings : (driver 319.23, Amber12 bugfix 18 applied, cuda 5.5)
>> > >>>> >
>> > >>>> > M.
>> > >>>> >
>> > >>>> >
>> > >>>> >
>> > >>>> >
>> > >>>> > Dne Mon, 03 Jun 2013 11:51:46 +0200 ET <sketchfoot.gmail.com>
>> > >>>> napsal/-a:
>> > >>>> >
>> > >>>> >> Hi all,
>> > >>>> >>
>> > >>>> >> I reran the benchmark with Amber recompiled and at the latest
>> > >>>> drivers
>> > >>>> >> with
>> > >>>> >> GPU in solo configuration yields the following results:
>> > >>>> >>
>> > >>>> >>
>> > >>>> >> When I run the tests on GPU-00_TeaNCake:
>> > >>>> >>
>> > >>>> >> 1) All the tests (across 2x repeats) finish successfully:
>> > >>>> >>
>> > >>>> >>
>> > >>>> >> 2) The sdiff logs indicate that reproducibility across the
>> two
>> > >>>> repeats
>> > >>>> >> is
>> > >>>> >> as follows:
>> > >>>> >>
>> > >>>> >> GB_myoglobin: Reproducible across 1,000,000 steps
>> > >>>> >> GB_nucleosome: No reproducibility shown from step 3,400
>> onwards.
>> > >>>> Also
>> > >>>> >> outfile is not written properly - blank gaps appear where
>> something
>> > >>>> >> should
>> > >>>> >> have been written.
>> > >>>> >> GB_TRPCage: Reproducible across 1,000,000 steps
>> > >>>> >>
>> > >>>> >> PME_JAC_production_NVE: No reproducibility shown from step
>> 35,000
>> > >>>> >> onwards.
>> > >>>> >> Also outfile is not written properly - blank gaps appear where
>> > >>>> something
>> > >>>> >> should have been written.
>> > >>>> >> PME_JAC_production_NPT: No reproducibility shown from step
>> 69,000
>> > >>>> >> onwards.
>> > >>>> >> Also outfile is not written properly - blank gaps appear where
>> > >>>> something
>> > >>>> >> should have been written.
>> > >>>> >> PME_FactorIX_production_NVE: Reproducible across 100k steps
>> > >>>> >> PME_FactorIX_production_NPT: Reproducible across 100k steps
>> > >>>> >> PME_Cellulose_production_NVE: Reproducible across 100k steps
>> > >>>> >> PME_Cellulose_production_NPT: No reproducibility shown from
>> step
>> > >>>> 17,000
>> > >>>> >> onwards. Also outfile is not written properly - blank gaps
>> appear
>> > >>>> where
>> > >>>> >> something should have been written.
>> > >>>> >>
>> > >>>> >> #################################################
>> > >>>> >>
>> > >>>> >>
>> > >>>> >> So it looks like the problem does occur in GB runs too.
>> Though I
>> > >>>> notice
>> > >>>> >> that running in single GPU mode seems to make the problem
>> appear
>> > >>>> much
>> > >>>> >> later
>> > >>>> >> than it occurs with dual GPUs, though obviously this is quite
>> > >>>> >> qualitative
>> > >>>> >> and based only of 1 repeat.
>> > >>>> >>
>> > >>>> >> br,
>> > >>>> >> g
>> > >>>> >>
>> > >>>> >>
>> > >>>> >>
>> > >>>> >>
>> > >>>> >> On 3 June 2013 10:28, ET <sketchfoot.gmail.com> wrote:
>> > >>>> >>
>> > >>>> >>> Hi Marek,
>> > >>>> >>>
>> > >>>> >>> I think what you say about Valley and Heaven are true to a
>> certain
>> > >>>> >>> extent,
>> > >>>> >>> but I think the links I posted to the EVGA overclock utility
>> &
>> MSI
>> > >>>> >>> Kombuster are very good ways of testing the card. I don't
>> know
>> the
>> > >>>> >>> details
>> > >>>> >>> of memtestG80 and cuda_memtest, but it seems to me that they
>> are
>> > >>>> >>> testing
>> > >>>> >>> one very specific component. i.e. The Memory. As the graphics
>> card
>> > >>>> >>> consists
>> > >>>> >>> of more than this, it is better to have a test that checks
>> the
>> > >>>> card
>> > >>>> in
>> > >>>> >>> a
>> > >>>> >>> more holistic manner IMO. :)
>> > >>>> >>>
>> > >>>> >>> I think this argument is supported by the fact that tech
>> support
>> > >>>> at
>> > >>>> the
>> > >>>> >>> store used a program called FurMark to stress test the GPU.
>> As
>> the
>> > >>>>
>> > >>>> GPU
>> > >>>> >>> I
>> > >>>> >>> returned kept failing the benchmark, they realized in less
>> than
>> > >>>> half a
>> > >>>> >>> day
>> > >>>> >>> it was faulty, whilst I wasted a couple of days mucking about
>> with
>> > >>>>
>> > >>>> GPU
>> > >>>> >>> memory tests using Gpuburn on linux.
>> > >>>> >>>
>> > >>>> >>> http://www.ozone3d.net/benchmarks/fur/
>> > >>>> >>>
>> > >>>> >>> I think if you are going to test on windows, you are better
>> of
>> > >>>> getting
>> > >>>> >>> MSI
>> > >>>> >>> Kombuster which I posted earlier. It contains the test
>> contained
>> > >>>> in
>> > >>>> >>> Furmark
>> > >>>> >>> and many additional tests that test the compute capability of
>> the
>> > >>>> card.
>> > >>>> >>>
>> > >>>> >>> best regards,
>> > >>>> >>> g
>> > >>>> >>>
>> > >>>> >> _______________________________________________
>> > >>>> >> AMBER mailing list
>> > >>>> >> AMBER.ambermd.org
>> > >>>> >> http://lists.ambermd.org/mailman/listinfo/amber
>> > >>>> >>
>> > >>>> >> __________ Informace od ESET NOD32 Antivirus, verze databaze
>> 8405
>> > >>>> >> (20130603) __________
>> > >>>> >>
>> > >>>> >> Tuto zpravu proveril ESET NOD32 Antivirus.
>> > >>>> >>
>> > >>>> >> http://www.eset.cz
>> > >>>> >>
>> > >>>> >>
>> > >>>> >>
>> > >>>> >
>> > >>>> >
>> > >>>>
>> > >>>>
>> > >>>> --
>> > >>>> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>> > >>>> http://www.opera.com/mail/
>> > >>>>
>> > >>>> _______________________________________________
>> > >>>> AMBER mailing list
>> > >>>> AMBER.ambermd.org
>> > >>>> http://lists.ambermd.org/mailman/listinfo/amber
>> > >>>>
>> > >>> _______________________________________________
>> > >>> AMBER mailing list
>> > >>> AMBER.ambermd.org
>> > >>> http://lists.ambermd.org/mailman/listinfo/amber
>> > >>>
>> > >>> __________ Informace od ESET NOD32 Antivirus, verze databaze 8407
>> > >>> (20130603) __________
>> > >>>
>> > >>> Tuto zpravu proveril ESET NOD32 Antivirus.
>> > >>>
>> > >>> http://www.eset.cz
>> > >>>
>> > >>>
>> > >>>
>> > >>
>> > >>
>> > >> --
>> > >> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>> > >> http://www.opera.com/mail/
>> > >>
>> > >> _______________________________________________
>> > >> AMBER mailing list
>> > >> AMBER.ambermd.org
>> > >> http://lists.ambermd.org/mailman/listinfo/amber
>> > >
>> > >
>> > >
>> > > _______________________________________________
>> > > AMBER mailing list
>> > > AMBER.ambermd.org
>> > > http://lists.ambermd.org/mailman/listinfo/amber
>> > >
>> > > __________ Informace od ESET NOD32 Antivirus, verze databaze 8408
>> > > (20130603) __________
>> > >
>> > > Tuto zpravu proveril ESET NOD32 Antivirus.
>> > >
>> > > http://www.eset.cz
>> > >
>> > >
>> > >
>> >
>> >
>> > --
>> > Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>> > http://www.opera.com/mail/
>> >
>> > _______________________________________________
>> > AMBER mailing list
>> > AMBER.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber
>> >
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
> __________ Informace od ESET NOD32 Antivirus, verze databaze 8408
> (20130603) __________
>
> Tuto zpravu proveril ESET NOD32 Antivirus.
>
> http://www.eset.cz
>
>
>


-- 
Tato zpráva byla vytvořena převratným poštovním klientem Opery:  
http://www.opera.com/mail/



TITANES - ns/day

GPU_0 JAC_NVE JAC_NPT FACTOR_IX_NVE FACTOR_IX_NPT CELLULOSE_NVE TRPCAGE MYOGLOBIN NUCLEOSOME
ROUND_1 115.91 ERR 30.56 25.01 ERR 595.09 202.56 3.45
ROUND_2 109.41 85.73 30.27 24.95 ERR 623.96 201.16 3.45
GPU_1
ROUND_1 114.92 85.97 29.85 24.56 ERR 599.20 195.91 3.40
ROUND_2 106.44 83.63 29.63 24.43 7.05 585.14 197.48 3.40


Total energy at step 100000

*TITAN_0 JAC_NVE JAC_NPT FACTOR_IX_NVE FACTOR_IX_NPT CELLULOSE_NVE TRPCAGE MYOGLOBIN NUCLEOSOME
ROUND_1 -58137.8526 ERR -234189.5802 -234370.3688 ERR -238.0523 -1429.6137 -66858.7444
ROUND_2 -58140.5142 -58159.9873 -234189.5802 -234370.3688 ERR -238.0523 -1429.6137 -66792.2804
*TITAN_1
ROUND_1 -58139.8792 -58147.8714 -234189.5802 -234370.3688 ERR -238.0523 -1429.6137 -66858.7444
ROUND_2 -58141.8652 -58150.9792 -234189.5802 -234370.3688 -443246.3206 -238.0523 -1429.6137 -66858.7444
*GTX_680
ROUND_1 -58139.7224 -58190.8157 -234184.6576 -234360.2490 -443246.3519 -238.0523 -1429.6137 -66841.1887
ROUND_2 -58139.7224 -58190.8157 -234184.6576 -234360.2490 -443246.3519 -238.0523 -1429.6137 -66841.1887
*GTX_580
ROUND_1 -58139.8773 -58158.3432 -234186.3908 -234391.0005 -443246.3519 -242.7692 -1366.9785 -66801.3274
ROUND_2 -58139.8773 -58158.3432 -234186.3908 -234391.0005 -443246.3519 -242.7692 -1366.9785 -66801.3274







_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Jun 03 2013 - 22:00:02 PDT
Custom Search