Re: [AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ?

From: Scott Le Grand <varelse2005.gmail.com>
Date: Thu, 6 Jun 2013 07:23:30 -0700

My *suspicion* is that the GB issue and the FFT issue have the same root
cause. The difference is that the GB issue happens in my own code so I'm
continuing to investigate but since the FFT code is NVIDIA's all I could do
was create a repro app for them and hand it off to them. We'll find a fix
and/or a workaround, we always do...

Scott




On Thu, Jun 6, 2013 at 3:40 AM, Marek Maly <marek.maly.ujep.cz> wrote:

> Welcome in the club :))
>
> First of all do not panic. Scott recently identified and reported
> some cuFFT "bug" in connection with Titans and sent it to NVIDIA,
> now we have to wait what the NVIDIA experts answer. There is also another
> Amber/Titan issue
> which has some another origin (GB of big systems i.e. NUCLEOSOME) you may
> try it
> as well. Amber guys are working perhaps also on that.
>
> So on your place I would wait with RMAing unless you have any other
> indications
> that your GPU might me damaged. In between you may do some tests of this
> GPU with memtestG80.
>
> here is the most recent version:
>
> ---
> memtestG80
> https://github.com/ihaque/memtestG80
> here is the sync fix code
>
> https://github.com/ihaque/memtestG80/commit/c4336a69fff07945c322d6c7fc40b0b12341cc4c
> ---
>
> BTW which Titan GPU are you using the stock one or the superclocked one ?
>
> Anyway I would recommend you to recompile Amber with the latests
> Amber 12 patch (bugfix 18) if you did not do it.
>
> M.
>
>
>
>
>
>
>
>
>
>
>
>
> Dne Thu, 06 Jun 2013 12:01:35 +0200 Jonathan Gough
> <jonathan.d.gough.gmail.com> napsal/-a:
>
> > Bad News.
> >
> > I ran each set of tests 4 times, nstlim=100000. FactorIX was the only one
> > that gave consistent results. Again I had a few that just died without
> > any
> > error messages.
> >
> > CentOs 6
> > gnu compilers
> > Cuda 5.0 and Driver Version: 319.23
> > AmberTools version 13.09
> > Amber version 12.18
> >
> > Cellulose_production_NVE/1/mdout: Etot = -443246.3206 EKtot =
> > 258074.3438 EPtot = -701320.6644
> > Cellulose_production_NVE/2/mdout Died at 4000 steps - no error message.
> > Cellulose_production_NVE/3/mdout: Etot = -443238.0345 EKtot =
> > 257651.0625 EPtot = -700889.0970
> > Cellulose_production_NVE/4/mdout: Etot = -443246.3206 EKtot =
> > 258074.3438 EPtot = -701320.6644
> >
> > Cellulose_production_NPT/1/mdout: Etot = -441009.1612 EKtot =
> > 257571.2031 EPtot = -698580.3643
> > Cellulose_production_NPT/2/mdout: Etot = -440947.3717 EKtot =
> > 257723.3750 EPtot = -698670.7467
> > Cellulose_production_NPT/3/mdout: Etot = -441024.3259 EKtot =
> > 257406.5781 EPtot = -698430.9041
> > Cellulose_production_NPT/4/mdout: Etot = -440970.6005 EKtot =
> > 257756.1250 EPtot = -698726.7255
> >
> > FactorIX_production_NVE/1/mdout: Etot = -234189.5802 EKtot =
> > 54845.8359 EPtot = -289035.4162
> > FactorIX_production_NVE/2/mdout: Etot = -234189.5802 EKtot =
> > 54845.8359 EPtot = -289035.4162
> > FactorIX_production_NVE/3/mdout: Etot = -234189.5802 EKtot =
> > 54845.8359 EPtot = -289035.4162
> > FactorIX_production_NVE/4/mdout: Etot = -234189.5802 EKtot =
> > 54845.8359 EPtot = -289035.4162
> >
> > FactorIX_production_NPT/1/mdout: Etot = -234493.4304 EKtot =
> > 55062.0156 EPtot = -289555.4460
> > FactorIX_production_NPT/2/mdout: Etot = -234493.4304 EKtot =
> > 55062.0156 EPtot = -289555.4460
> > FactorIX_production_NPT/3/mdout: Etot = -234493.4304 EKtot =
> > 55062.0156 EPtot = -289555.4460
> > FactorIX_production_NPT/4/mdout: Etot = -234493.4304 EKtot =
> > 55062.0156 EPtot = -289555.4460
> >
> > JAC_production_NVE/1/mdout: Etot = -58141.0647 EKtot =
> > 14347.6699 EPtot = -72488.7346
> > JAC_production_NVE/2/mdout: Etot = -58141.4961 EKtot =
> > 14320.1465 EPtot = -72461.6425
> > JAC_production_NVE/3/mdout: Died at 48000 steps
> > JAC_production_NVE/4/mdout: Etot = -58141.6938 EKtot =
> > 14257.2305 EPtot = -72398.9243
> >
> > JAC_production_NPT/1/mdout: Died at 78000 steps
> > JAC_production_NPT/2/mdout: Etot = -58206.6103 EKtot =
> > 14384.7959 EPtot = -72591.4062
> > JAC_production_NPT/3/mdout: Etot = -58211.2469 EKtot =
> > 14454.1592 EPtot = -72665.4061
> > JAC_production_NPT/1/mdout: Died at 89000 steps
> >
> >
> > Any recommendations on what to do? Send the card back? Update drivers?
> > Update Cuda?
> >
> >
> >
> >
> > On Wed, Jun 5, 2013 at 6:45 PM, Marek Maly <marek.maly.ujep.cz> wrote:
> >
> >> Yes you got it,
> >>
> >> one more thing. Check carefully the benchmark mdin files and
> >> if you see there "ig=-1" just delete this, to ensure, that
> >> both runs of the given test will run using the same random seed.
> >>
> >> (As I remember I found it just in one or two tests, don't remember which
> >> one).
> >>
> >> Let us know your results i.e. if all the tests (JAC NVE/NPT, FACTOR_IX
> >> NVE/NPT etc.)
> >> successfully finished all 100K steps (in both runs) and if moreover the
> >> results from both runs
> >> are identical (just check the final energy).
> >>
> >> In case of any error (writen in mdout file or in standard output (screen
> >> or nohup.out ...) ), please report it here as well.
> >>
> >> Thanks,
> >>
> >> M.
> >>
> >>
> >>
> >>
> >>
> >> Dne Thu, 06 Jun 2013 00:34:39 +0200 Jonathan Gough
> >> <jonathan.d.gough.gmail.com> napsal/-a:
> >>
> >> > I know I'm late in the game, but I have been reading some of these two
> >> > Titan threads. I'm now attempting to test my 1 Titan card and I want
> >> to
> >> > make sure I understand what I aught to be doing.
> >> >
> >> > Download the Amber_GPU_Benchmark_Suite
> >> > in mdin, change nstlim=100000
> >> > and then run the 6 benchmarks at least 2 times each
> >> >
> >> > yes?
> >> >
> >> > The issue that we have had is that simulations would just prematurely
> >> > stop.
> >> > We didn't see any error messages in the mdout file though, they just
> >> > stopped.
> >> >
> >> > Were using Cuda 5.0 and Driver Version: 319.23
> >> >
> >> >
> >> >
> >> > On Wed, Jun 5, 2013 at 1:29 PM, Marek Maly <marek.maly.ujep.cz>
> wrote:
> >> >
> >> >> Hi Scott,
> >> >>
> >> >> thanks for update ! Let's see what will be reaction from NVIDIA.
> >> >> In the worst case let's hope that also some other (NON-NVIDIA) "GPU
> >> FFT
> >> >> library"
> >> >> alternatives exists (to be compiled/used alternatively with
> >> pmemd.cuda)
> >> >>
> >> >> BTW I just found this perhaps interesting article (I only list the
> >> >> supplementary part. ):
> >> >>
> >> >> http://www.computer.org/csdl/trans/td/preprint/06470608-abs.html
> >> >>
> >> >> OK, meanwhile I finished my experiment/tests with swapping my two
> >> titans
> >> >> in slots. As you can see below it did not solve the problems on my
> >> >> "less stable" titan, but on the other hand there is significant
> >> >> improvement.
> >> >> I will now try with just "my less stable" GPU plugged on
> >> motherboard to
> >> >> eventually confirm that it's less stability has origin in it's higher
> >> >> sensitivity
> >> >> to dual GPU configuration (OR just to dual GPU config with another
> >> Titan
> >> >> maybe that
> >> >> with GTX 580/680 it will be OK or at least better than with 2
> >> Titans).
> >> >>
> >> >> M.
> >> >>
> >> >>
> >> >> SIMULTANEOUS TEST (BOTH GPUS) running at the same time
> >> >>
> >> >> density (100K steps, NPT, restrained solute)
> >> >> prod1 and prod2 (250K steps, NPT)
> >> >>
> >> >> TITAN_0, TITAN_1 now rather identify PCI slots than given cards.
> >> >>
> >> >> all the errs I have obtained here is here just:
> >> >>
> >> >> -----
> >> >> cudaMemcpy GpuBuffer::Download failed unspecified launch failure
> >> >> -----
> >> >>
> >> >> #1 ORIGINAL CONFIGURATION
> >> >>
> >> >> density prod1 prod2
> >> >>
> >> >> TITAN_0
> >> >> -297755.2479 -299267.1086 65K
> >> >> 20K -299411.2631 100K
> >> >>
> >> >> TITAN_1
> >> >> -297906.5447 -298657.3725 -298683.8965
> >> >> -297906.5447 -298657.3725 -298683.8965
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> #2 AFTER GPU SWAPPING (respect to PCI slots)
> >> >>
> >> >> density prod1 prod2
> >> >>
> >> >> TITAN_0 (so these are results of the GPU named before as TITAN_1)
> >> >> -297906.5447 -298657.3725 -298683.8965
> >> >> -297906.5447 -298657.3725 -298683.8965
> >> >>
> >> >> TITAN_1 (so these are results of the GPU named before as TITAN_0)
> >> >> -297906.5447 240K -298764.5294
> >> >> -297752.2836 -298997.8891 -299610.3812
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> Dne Wed, 05 Jun 2013 18:15:48 +0200 Scott Le Grand
> >> >> <varelse2005.gmail.com>
> >> >> napsal/-a:
> >> >>
> >> >> > Filip,
> >> >> > What's happening on Titan can take a while to trigger. I have
> >> >> delivered
> >> >> > a
> >> >> > repro to NVIDIA that shows exactly what's happening but it's up to
> >> >> them
> >> >> > to
> >> >> > explain why because its occurring inside cuFFT. That's why you
> >> need
> >> >> to
> >> >> > run
> >> >> > at least 100K iterations to see a single occurrence.
> >> >> >
> >> >> > There's a second issue that's happening with large GB simulations,
> >> but
> >> >> > that
> >> >> > one is even harder to trap. That doesn't mean it isn't happening,
> >> >> just
> >> >> > that it's on the very edge of doing so on Titan.
> >> >> >
> >> >> > Thankfully, I have not been able to trigger either bug on GK104 or
> >> >> K20...
> >> >> > _______________________________________________
> >> >> > AMBER mailing list
> >> >> > AMBER.ambermd.org
> >> >> > http://lists.ambermd.org/mailman/listinfo/amber
> >> >> >
> >> >> > __________ Informace od ESET NOD32 Antivirus, verze databaze 8415
> >> >> > (20130605) __________
> >> >> >
> >> >> > Tuto zpravu proveril ESET NOD32 Antivirus.
> >> >> >
> >> >> > http://www.eset.cz
> >> >> >
> >> >> >
> >> >> >
> >> >>
> >> >>
> >> >> --
> >> >> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
> >> >> http://www.opera.com/mail/
> >> >>
> >> >> _______________________________________________
> >> >> AMBER mailing list
> >> >> AMBER.ambermd.org
> >> >> http://lists.ambermd.org/mailman/listinfo/amber
> >> >>
> >> > _______________________________________________
> >> > AMBER mailing list
> >> > AMBER.ambermd.org
> >> > http://lists.ambermd.org/mailman/listinfo/amber
> >> >
> >> > __________ Informace od ESET NOD32 Antivirus, verze databaze 8416
> >> > (20130605) __________
> >> >
> >> > Tuto zpravu proveril ESET NOD32 Antivirus.
> >> >
> >> > http://www.eset.cz
> >> >
> >> >
> >> >
> >>
> >>
> >> --
> >> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
> >> http://www.opera.com/mail/
> >>
> >> _______________________________________________
> >> AMBER mailing list
> >> AMBER.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber
> >>
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
> > __________ Informace od ESET NOD32 Antivirus, verze databaze 8417
> > (20130606) __________
> >
> > Tuto zpravu proveril ESET NOD32 Antivirus.
> >
> > http://www.eset.cz
> >
> >
> >
>
>
> --
> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
> http://www.opera.com/mail/
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jun 06 2013 - 07:30:02 PDT
Custom Search