Addressing Divi's two points:
1. We're trying to find a way to do this...
2. I am extremely paranoid and while I would still use the Titans for
development and testing, I would also currently do my publishable runs on
GK104 GPUs or K20s. Given that, if you're comfortable with
nondeterministic execution ala GROMACS, ACEMD, and NAMD, what's going on
here is seemingly no worse. I'm *not* comfortable with that myself and I
intend to find a fix or workaround like we did a couple years ago with
GTX4xx and GTX5xx. So your best strategy might just be to wait a week or
two and see what comes of the bug hunt.
Marek et al. if these GPU tests are failing on the Titans, then by all
means return them without hesitation, but I don't think consumer level GPUs
are tested with the same level of rigor as Teslas. The upside is you get
30% better performance for 1/3 the price. The downside is that IMO you
should be carefully validate them before using them. What I'm seeing here
looks like single bit differences at the low-order bits that cause a tiny
fluctuation that ultimately mushrooms and diverges the whole shebang along
with occasional crashes. The crashes seem to occur in cuFFT somewhere. I
have yet to see divergence there yet.
Scott
On Mon, Jun 3, 2013 at 9:42 AM, Marek Maly <marek.maly.ujep.cz> wrote:
> Hi,
> so here are my NUCLEOSOME test results. All tests finished (although the
> TITAN_0/ROUND_2) with "****" energy (*** records starts from the 75K step
> so
> it is surprise for me that test was finished at the end). All the results
> are irreproducible (driver 319.23, Amber12 bugfix 18 applied, cuda 5.5) I
> will
> repeat it with CUDA 5.0.
>
> M.
>
> >>>>>> TITAN_0
>
>
> ROUND_1
>
> ------------------------------------------------------------------------------
>
>
> NSTEP = 100000 TIME(PS) = 300.000 TEMP(K) = 310.60 PRESS
> = 0.0
> Etot = -66843.8345 EKtot = 19690.5156 EPtot =
> -86534.3502
> BOND = 5887.3611 ANGLE = 13673.5215 DIHED =
> 16941.7678
> 1-4 NB = 5576.6911 1-4 EEL = 1371.5924 VDWAALS =
> -13647.8461
> EELEC = -14410.1252 EGB = -102286.9459 RESTRAINT =
> 359.6331
> EAMBER (non-restraint) = -86893.9832
>
> ------------------------------------------------------------------------------
>
> ROUND_2
>
> ------------------------------------------------------------------------------
>
>
> NSTEP = 100000 TIME(PS) = 300.000 TEMP(K) =********* PRESS
> = 0.0
> Etot = ************** EKtot = ************** EPtot =
> 4279668.7807
> BOND = -0.0000 ANGLE = 4681740.3488 DIHED =
> 67661.6797
> 1-4 NB = -0.0000 1-4 EEL = -2.0373 VDWAALS =
> 244.1012
> EELEC = 72548.4049 EGB = -542523.7166 RESTRAINT =
> -0.0000
> EAMBER (non-restraint) = 4279668.7807
>
> ------------------------------------------------------------------------------
> STARS from the 75k step ...
>
>
> >>>>>> TITAN_1
>
>
> ROUND_1
>
> ------------------------------------------------------------------------------
>
>
> NSTEP = 100000 TIME(PS) = 300.000 TEMP(K) = 310.36 PRESS
> = 0.0
> Etot = -66846.8801 EKtot = 19675.0488 EPtot =
> -86521.9289
> BOND = 5760.2422 ANGLE = 13619.8710 DIHED =
> 16996.9045
> 1-4 NB = 5645.6416 1-4 EEL = 1774.6967 VDWAALS =
> -13622.9343
> EELEC = -14168.1788 EGB = -102880.8089 RESTRAINT =
> 352.6371
> EAMBER (non-restraint) = -86874.5660
>
> ------------------------------------------------------------------------------
>
> ROUND_2
>
> ------------------------------------------------------------------------------
>
>
> NSTEP = 100000 TIME(PS) = 300.000 TEMP(K) = 311.00 PRESS
> = 0.0
> Etot = -66874.9016 EKtot = 19715.3633 EPtot =
> -86590.2649
> BOND = 5819.0667 ANGLE = 13683.6633 DIHED =
> 16918.8596
> 1-4 NB = 5627.0932 1-4 EEL = 1576.9564 VDWAALS =
> -13747.1032
> EELEC = -15232.3280 EGB = -101590.5078 RESTRAINT =
> 354.0348
> EAMBER (non-restraint) = -86944.2997
>
> ------------------------------------------------------------------------------
>
>
>
>
>
>
>
>
> Dne Mon, 03 Jun 2013 12:34:15 +0200 Marek Maly <marek.maly.ujep.cz>
> napsal/-a:
>
> > OK, I will try NUCLEOSOME case as well with my latest
> > settings : (driver 319.23, Amber12 bugfix 18 applied, cuda 5.5)
> >
> > M.
> >
> >
> >
> >
> > Dne Mon, 03 Jun 2013 11:51:46 +0200 ET <sketchfoot.gmail.com> napsal/-a:
> >
> >> Hi all,
> >>
> >> I reran the benchmark with Amber recompiled and at the latest drivers
> >> with
> >> GPU in solo configuration yields the following results:
> >>
> >>
> >> When I run the tests on GPU-00_TeaNCake:
> >>
> >> 1) All the tests (across 2x repeats) finish successfully:
> >>
> >>
> >> 2) The sdiff logs indicate that reproducibility across the two repeats
> >> is
> >> as follows:
> >>
> >> GB_myoglobin: Reproducible across 1,000,000 steps
> >> GB_nucleosome: No reproducibility shown from step 3,400 onwards. Also
> >> outfile is not written properly - blank gaps appear where something
> >> should
> >> have been written.
> >> GB_TRPCage: Reproducible across 1,000,000 steps
> >>
> >> PME_JAC_production_NVE: No reproducibility shown from step 35,000
> >> onwards.
> >> Also outfile is not written properly - blank gaps appear where something
> >> should have been written.
> >> PME_JAC_production_NPT: No reproducibility shown from step 69,000
> >> onwards.
> >> Also outfile is not written properly - blank gaps appear where something
> >> should have been written.
> >> PME_FactorIX_production_NVE: Reproducible across 100k steps
> >> PME_FactorIX_production_NPT: Reproducible across 100k steps
> >> PME_Cellulose_production_NVE: Reproducible across 100k steps
> >> PME_Cellulose_production_NPT: No reproducibility shown from step 17,000
> >> onwards. Also outfile is not written properly - blank gaps appear where
> >> something should have been written.
> >>
> >> #################################################
> >>
> >>
> >> So it looks like the problem does occur in GB runs too. Though I notice
> >> that running in single GPU mode seems to make the problem appear much
> >> later
> >> than it occurs with dual GPUs, though obviously this is quite
> >> qualitative
> >> and based only of 1 repeat.
> >>
> >> br,
> >> g
> >>
> >>
> >>
> >>
> >> On 3 June 2013 10:28, ET <sketchfoot.gmail.com> wrote:
> >>
> >>> Hi Marek,
> >>>
> >>> I think what you say about Valley and Heaven are true to a certain
> >>> extent,
> >>> but I think the links I posted to the EVGA overclock utility & MSI
> >>> Kombuster are very good ways of testing the card. I don't know the
> >>> details
> >>> of memtestG80 and cuda_memtest, but it seems to me that they are
> >>> testing
> >>> one very specific component. i.e. The Memory. As the graphics card
> >>> consists
> >>> of more than this, it is better to have a test that checks the card in
> >>> a
> >>> more holistic manner IMO. :)
> >>>
> >>> I think this argument is supported by the fact that tech support at the
> >>> store used a program called FurMark to stress test the GPU. As the GPU
> >>> I
> >>> returned kept failing the benchmark, they realized in less than half a
> >>> day
> >>> it was faulty, whilst I wasted a couple of days mucking about with GPU
> >>> memory tests using Gpuburn on linux.
> >>>
> >>> http://www.ozone3d.net/benchmarks/fur/
> >>>
> >>> I think if you are going to test on windows, you are better of getting
> >>> MSI
> >>> Kombuster which I posted earlier. It contains the test contained in
> >>> Furmark
> >>> and many additional tests that test the compute capability of the card.
> >>>
> >>> best regards,
> >>> g
> >>>
> >> _______________________________________________
> >> AMBER mailing list
> >> AMBER.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber
> >>
> >> __________ Informace od ESET NOD32 Antivirus, verze databaze 8405
> >> (20130603) __________
> >>
> >> Tuto zpravu proveril ESET NOD32 Antivirus.
> >>
> >> http://www.eset.cz
> >>
> >>
> >>
> >
> >
>
>
> --
> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
> http://www.opera.com/mail/
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Jun 03 2013 - 11:30:03 PDT