Re: [AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ?

From: Jonathan Gough <jonathan.d.gough.gmail.com>
Date: Fri, 7 Jun 2013 15:48:36 -0400

Good News (maybe)

  1. The nucleosome calculations were reproducible at nstlim=100000
  2. My new GTX 780 seems to be stable See results below

CentOs 6
gnu compilers
Cuda 5.0 and Driver Version: 319.23
 AmberTools version 13.09
     Amber version 12.18

EVGA 06G-P4-2793-KR GeForce GTX TITAN
GB-Nucleosome
nucleosome/1/mdout: Etot = -66858.7444 EKtot = 19709.4492
 EPtot = -86568.1936
nucleosome/2/mdout: Etot = -66858.7444 EKtot = 19709.4492
 EPtot = -86568.1936
nucleosome/3/mdout: Etot = -66858.7444 EKtot = 19709.4492
 EPtot = -86568.1936
nucleosome/4/mdout: Etot = -66858.7444 EKtot = 19709.4492
 EPtot = -86568.1936


FWIW: Here is data for GTX 780
EVGA 03G-P4-2781-KR GeForce GTX 780
Ran each of the tests at nstlim=100000 4x

Not that I know if there was an issue, but paranoia has set in, and I felt
the need to be comprehensive
Everything is looking reproducible.

JAC_production_NPT/1/mdout: Etot = -58221.1921 EKtot =
14415.7754 EPtot = -72636.9675
JAC_production_NPT/2/mdout: Etot = -58221.1921 EKtot =
14415.7754 EPtot = -72636.9675
JAC_production_NPT/3/mdout: Etot = -58221.1921 EKtot =
14415.7754 EPtot = -72636.9675
JAC_production_NPT/4/mdout: Etot = -58221.1921 EKtot =
14415.7754 EPtot = -72636.9675

JAC_production_NVE/1/mdout: Etot = -58139.8773 EKtot =
14266.4307 EPtot = -72406.3079
JAC_production_NVE/2/mdout: Etot = -58139.8773 EKtot =
14266.4307 EPtot = -72406.3079
JAC_production_NVE/3/mdout: Etot = -58139.8773 EKtot =
14266.4307 EPtot = -72406.3079
JAC_production_NVE/4/mdout: Etot = -58139.8773 EKtot =
14266.4307 EPtot = -72406.3079

FactorIX_production_NVE/1/mdout: Etot = -234189.5802 EKtot =
54845.8359 EPtot = -289035.4162
FactorIX_production_NVE/2/mdout: Etot = -234189.5802 EKtot =
54845.8359 EPtot = -289035.4162
FactorIX_production_NVE/3/mdout: Etot = -234189.5802 EKtot =
54845.8359 EPtot = -289035.4162
FactorIX_production_NVE/4/mdout: Etot = -234189.5802 EKtot =
54845.8359 EPtot = -289035.4162

FactorIX_production_NPT/1/mdout: Etot = -234493.4304 EKtot =
55062.0156 EPtot = -289555.4460
FactorIX_production_NPT/2/mdout: Etot = -234493.4304 EKtot =
55062.0156 EPtot = -289555.4460
FactorIX_production_NPT/3/mdout: Etot = -234493.4304 EKtot =
55062.0156 EPtot = -289555.4460
FactorIX_production_NPT/4/mdout: Etot = -234493.4304 EKtot =
55062.0156 EPtot = -289555.4460


Cellulose_production_NPT/1/mdout: Etot = -441074.6000 EKtot =
 258388.7500 EPtot = -699463.3500
Cellulose_production_NPT/2/mdout: Etot = -441074.6000 EKtot =
 258388.7500 EPtot = -699463.3500
Cellulose_production_NPT/3/mdout:* *Etot = -441074.6000 EKtot =
 258388.7500 EPtot = -699463.3500
Cellulose_production_NPT/4/mdout: Etot = -441074.6000 EKtot =
 258388.7500 EPtot = -699463.3500


Cellulose_production_NVE/1/mdout: Etot = -443246.3519 EKtot =
 258074.3125 EPtot = -701320.6644
Cellulose_production_NVE/2/mdout: Etot = -443246.3519 EKtot =
 258074.3125 EPtot = -701320.6644
Cellulose_production_NVE/3/mdout: Etot = -443246.3519 EKtot =
 258074.3125 EPtot = -701320.6644
Cellulose_production_NVE/4/mdout: Etot = -443246.3519 EKtot =
 258074.3125 EPtot = -701320.6644




On Thu, Jun 6, 2013 at 10:29 AM, Marek Maly <marek.maly.ujep.cz> wrote:

> OK, let us know your NUCLEOSOME results (this test will take some time
> ...).
>
> M.
>
>
>
> Dne Thu, 06 Jun 2013 16:37:03 +0200 Jonathan Gough
> <jonathan.d.gough.gmail.com> napsal/-a:
>
> > I have the:
> > EVGA 06G-P4-2793-KR GeForce GTX TITAN SuperClocked Signature 6GB 384-bit
> > GDDR5 PCI Express 3.0 x16 HDCP, SLI Ready Video Card
> >
> > and the previously posted results were with bugfix 18. Checking GB
> > nucleosome now
> >
> >
> > On Thu, Jun 6, 2013 at 6:40 AM, Marek Maly <marek.maly.ujep.cz> wrote:
> >
> >> Welcome in the club :))
> >>
> >> First of all do not panic. Scott recently identified and reported
> >> some cuFFT "bug" in connection with Titans and sent it to NVIDIA,
> >> now we have to wait what the NVIDIA experts answer. There is also
> >> another
> >> Amber/Titan issue
> >> which has some another origin (GB of big systems i.e. NUCLEOSOME) you
> >> may
> >> try it
> >> as well. Amber guys are working perhaps also on that.
> >>
> >> So on your place I would wait with RMAing unless you have any other
> >> indications
> >> that your GPU might me damaged. In between you may do some tests of this
> >> GPU with memtestG80.
> >>
> >> here is the most recent version:
> >>
> >> ---
> >> memtestG80
> >> https://github.com/ihaque/memtestG80
> >> here is the sync fix code
> >>
> >>
> https://github.com/ihaque/memtestG80/commit/c4336a69fff07945c322d6c7fc40b0b12341cc4c
> >> ---
> >>
> >> BTW which Titan GPU are you using the stock one or the superclocked one
> >> ?
> >>
> >> Anyway I would recommend you to recompile Amber with the latests
> >> Amber 12 patch (bugfix 18) if you did not do it.
> >>
> >> M.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> Dne Thu, 06 Jun 2013 12:01:35 +0200 Jonathan Gough
> >> <jonathan.d.gough.gmail.com> napsal/-a:
> >>
> >> > Bad News.
> >> >
> >> > I ran each set of tests 4 times, nstlim=100000. FactorIX was the only
> >> one
> >> > that gave consistent results. Again I had a few that just died without
> >> > any
> >> > error messages.
> >> >
> >> > CentOs 6
> >> > gnu compilers
> >> > Cuda 5.0 and Driver Version: 319.23
> >> > AmberTools version 13.09
> >> > Amber version 12.18
> >> >
> >> > Cellulose_production_NVE/1/mdout: Etot = -443246.3206 EKtot =
> >> > 258074.3438 EPtot = -701320.6644
> >> > Cellulose_production_NVE/2/mdout Died at 4000 steps - no error
> >> message.
> >> > Cellulose_production_NVE/3/mdout: Etot = -443238.0345 EKtot =
> >> > 257651.0625 EPtot = -700889.0970
> >> > Cellulose_production_NVE/4/mdout: Etot = -443246.3206 EKtot =
> >> > 258074.3438 EPtot = -701320.6644
> >> >
> >> > Cellulose_production_NPT/1/mdout: Etot = -441009.1612 EKtot =
> >> > 257571.2031 EPtot = -698580.3643
> >> > Cellulose_production_NPT/2/mdout: Etot = -440947.3717 EKtot =
> >> > 257723.3750 EPtot = -698670.7467
> >> > Cellulose_production_NPT/3/mdout: Etot = -441024.3259 EKtot =
> >> > 257406.5781 EPtot = -698430.9041
> >> > Cellulose_production_NPT/4/mdout: Etot = -440970.6005 EKtot =
> >> > 257756.1250 EPtot = -698726.7255
> >> >
> >> > FactorIX_production_NVE/1/mdout: Etot = -234189.5802 EKtot =
> >> > 54845.8359 EPtot = -289035.4162
> >> > FactorIX_production_NVE/2/mdout: Etot = -234189.5802 EKtot =
> >> > 54845.8359 EPtot = -289035.4162
> >> > FactorIX_production_NVE/3/mdout: Etot = -234189.5802 EKtot =
> >> > 54845.8359 EPtot = -289035.4162
> >> > FactorIX_production_NVE/4/mdout: Etot = -234189.5802 EKtot =
> >> > 54845.8359 EPtot = -289035.4162
> >> >
> >> > FactorIX_production_NPT/1/mdout: Etot = -234493.4304 EKtot =
> >> > 55062.0156 EPtot = -289555.4460
> >> > FactorIX_production_NPT/2/mdout: Etot = -234493.4304 EKtot =
> >> > 55062.0156 EPtot = -289555.4460
> >> > FactorIX_production_NPT/3/mdout: Etot = -234493.4304 EKtot =
> >> > 55062.0156 EPtot = -289555.4460
> >> > FactorIX_production_NPT/4/mdout: Etot = -234493.4304 EKtot =
> >> > 55062.0156 EPtot = -289555.4460
> >> >
> >> > JAC_production_NVE/1/mdout: Etot = -58141.0647 EKtot =
> >> > 14347.6699 EPtot = -72488.7346
> >> > JAC_production_NVE/2/mdout: Etot = -58141.4961 EKtot =
> >> > 14320.1465 EPtot = -72461.6425
> >> > JAC_production_NVE/3/mdout: Died at 48000 steps
> >> > JAC_production_NVE/4/mdout: Etot = -58141.6938 EKtot =
> >> > 14257.2305 EPtot = -72398.9243
> >> >
> >> > JAC_production_NPT/1/mdout: Died at 78000 steps
> >> > JAC_production_NPT/2/mdout: Etot = -58206.6103 EKtot =
> >> > 14384.7959 EPtot = -72591.4062
> >> > JAC_production_NPT/3/mdout: Etot = -58211.2469 EKtot =
> >> > 14454.1592 EPtot = -72665.4061
> >> > JAC_production_NPT/1/mdout: Died at 89000 steps
> >> >
> >> >
> >> > Any recommendations on what to do? Send the card back? Update drivers?
> >> > Update Cuda?
> >> >
> >> >
> >> >
> >> >
> >> > On Wed, Jun 5, 2013 at 6:45 PM, Marek Maly <marek.maly.ujep.cz>
> wrote:
> >> >
> >> >> Yes you got it,
> >> >>
> >> >> one more thing. Check carefully the benchmark mdin files and
> >> >> if you see there "ig=-1" just delete this, to ensure, that
> >> >> both runs of the given test will run using the same random seed.
> >> >>
> >> >> (As I remember I found it just in one or two tests, don't remember
> >> which
> >> >> one).
> >> >>
> >> >> Let us know your results i.e. if all the tests (JAC NVE/NPT,
> >> FACTOR_IX
> >> >> NVE/NPT etc.)
> >> >> successfully finished all 100K steps (in both runs) and if moreover
> >> the
> >> >> results from both runs
> >> >> are identical (just check the final energy).
> >> >>
> >> >> In case of any error (writen in mdout file or in standard output
> >> (screen
> >> >> or nohup.out ...) ), please report it here as well.
> >> >>
> >> >> Thanks,
> >> >>
> >> >> M.
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> Dne Thu, 06 Jun 2013 00:34:39 +0200 Jonathan Gough
> >> >> <jonathan.d.gough.gmail.com> napsal/-a:
> >> >>
> >> >> > I know I'm late in the game, but I have been reading some of these
> >> two
> >> >> > Titan threads. I'm now attempting to test my 1 Titan card and I
> >> want
> >> >> to
> >> >> > make sure I understand what I aught to be doing.
> >> >> >
> >> >> > Download the Amber_GPU_Benchmark_Suite
> >> >> > in mdin, change nstlim=100000
> >> >> > and then run the 6 benchmarks at least 2 times each
> >> >> >
> >> >> > yes?
> >> >> >
> >> >> > The issue that we have had is that simulations would just
> >> prematurely
> >> >> > stop.
> >> >> > We didn't see any error messages in the mdout file though, they
> >> just
> >> >> > stopped.
> >> >> >
> >> >> > Were using Cuda 5.0 and Driver Version: 319.23
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Wed, Jun 5, 2013 at 1:29 PM, Marek Maly <marek.maly.ujep.cz>
> >> wrote:
> >> >> >
> >> >> >> Hi Scott,
> >> >> >>
> >> >> >> thanks for update ! Let's see what will be reaction from NVIDIA.
> >> >> >> In the worst case let's hope that also some other (NON-NVIDIA)
> >> "GPU
> >> >> FFT
> >> >> >> library"
> >> >> >> alternatives exists (to be compiled/used alternatively with
> >> >> pmemd.cuda)
> >> >> >>
> >> >> >> BTW I just found this perhaps interesting article (I only list the
> >> >> >> supplementary part. ):
> >> >> >>
> >> >> >> http://www.computer.org/csdl/trans/td/preprint/06470608-abs.html
> >> >> >>
> >> >> >> OK, meanwhile I finished my experiment/tests with swapping my two
> >> >> titans
> >> >> >> in slots. As you can see below it did not solve the problems on my
> >> >> >> "less stable" titan, but on the other hand there is significant
> >> >> >> improvement.
> >> >> >> I will now try with just "my less stable" GPU plugged on
> >> >> motherboard to
> >> >> >> eventually confirm that it's less stability has origin in it's
> >> higher
> >> >> >> sensitivity
> >> >> >> to dual GPU configuration (OR just to dual GPU config with another
> >> >> Titan
> >> >> >> maybe that
> >> >> >> with GTX 580/680 it will be OK or at least better than with 2
> >> >> Titans).
> >> >> >>
> >> >> >> M.
> >> >> >>
> >> >> >>
> >> >> >> SIMULTANEOUS TEST (BOTH GPUS) running at the same time
> >> >> >>
> >> >> >> density (100K steps, NPT, restrained solute)
> >> >> >> prod1 and prod2 (250K steps, NPT)
> >> >> >>
> >> >> >> TITAN_0, TITAN_1 now rather identify PCI slots than given cards.
> >> >> >>
> >> >> >> all the errs I have obtained here is here just:
> >> >> >>
> >> >> >> -----
> >> >> >> cudaMemcpy GpuBuffer::Download failed unspecified launch failure
> >> >> >> -----
> >> >> >>
> >> >> >> #1 ORIGINAL CONFIGURATION
> >> >> >>
> >> >> >> density prod1 prod2
> >> >> >>
> >> >> >> TITAN_0
> >> >> >> -297755.2479 -299267.1086 65K
> >> >> >> 20K -299411.2631 100K
> >> >> >>
> >> >> >> TITAN_1
> >> >> >> -297906.5447 -298657.3725 -298683.8965
> >> >> >> -297906.5447 -298657.3725 -298683.8965
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> #2 AFTER GPU SWAPPING (respect to PCI slots)
> >> >> >>
> >> >> >> density prod1 prod2
> >> >> >>
> >> >> >> TITAN_0 (so these are results of the GPU named before as TITAN_1)
> >> >> >> -297906.5447 -298657.3725 -298683.8965
> >> >> >> -297906.5447 -298657.3725 -298683.8965
> >> >> >>
> >> >> >> TITAN_1 (so these are results of the GPU named before as TITAN_0)
> >> >> >> -297906.5447 240K -298764.5294
> >> >> >> -297752.2836 -298997.8891 -299610.3812
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> Dne Wed, 05 Jun 2013 18:15:48 +0200 Scott Le Grand
> >> >> >> <varelse2005.gmail.com>
> >> >> >> napsal/-a:
> >> >> >>
> >> >> >> > Filip,
> >> >> >> > What's happening on Titan can take a while to trigger. I have
> >> >> >> delivered
> >> >> >> > a
> >> >> >> > repro to NVIDIA that shows exactly what's happening but it's up
> >> to
> >> >> >> them
> >> >> >> > to
> >> >> >> > explain why because its occurring inside cuFFT. That's why you
> >> >> need
> >> >> >> to
> >> >> >> > run
> >> >> >> > at least 100K iterations to see a single occurrence.
> >> >> >> >
> >> >> >> > There's a second issue that's happening with large GB
> >> simulations,
> >> >> but
> >> >> >> > that
> >> >> >> > one is even harder to trap. That doesn't mean it isn't
> >> happening,
> >> >> >> just
> >> >> >> > that it's on the very edge of doing so on Titan.
> >> >> >> >
> >> >> >> > Thankfully, I have not been able to trigger either bug on GK104
> >> or
> >> >> >> K20...
> >> >> >> > _______________________________________________
> >> >> >> > AMBER mailing list
> >> >> >> > AMBER.ambermd.org
> >> >> >> > http://lists.ambermd.org/mailman/listinfo/amber
> >> >> >> >
> >> >> >> > __________ Informace od ESET NOD32 Antivirus, verze databaze
> >> 8415
> >> >> >> > (20130605) __________
> >> >> >> >
> >> >> >> > Tuto zpravu proveril ESET NOD32 Antivirus.
> >> >> >> >
> >> >> >> > http://www.eset.cz
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
> >> >> >> http://www.opera.com/mail/
> >> >> >>
> >> >> >> _______________________________________________
> >> >> >> AMBER mailing list
> >> >> >> AMBER.ambermd.org
> >> >> >> http://lists.ambermd.org/mailman/listinfo/amber
> >> >> >>
> >> >> > _______________________________________________
> >> >> > AMBER mailing list
> >> >> > AMBER.ambermd.org
> >> >> > http://lists.ambermd.org/mailman/listinfo/amber
> >> >> >
> >> >> > __________ Informace od ESET NOD32 Antivirus, verze databaze 8416
> >> >> > (20130605) __________
> >> >> >
> >> >> > Tuto zpravu proveril ESET NOD32 Antivirus.
> >> >> >
> >> >> > http://www.eset.cz
> >> >> >
> >> >> >
> >> >> >
> >> >>
> >> >>
> >> >> --
> >> >> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
> >> >> http://www.opera.com/mail/
> >> >>
> >> >> _______________________________________________
> >> >> AMBER mailing list
> >> >> AMBER.ambermd.org
> >> >> http://lists.ambermd.org/mailman/listinfo/amber
> >> >>
> >> > _______________________________________________
> >> > AMBER mailing list
> >> > AMBER.ambermd.org
> >> > http://lists.ambermd.org/mailman/listinfo/amber
> >> >
> >> > __________ Informace od ESET NOD32 Antivirus, verze databaze 8417
> >> > (20130606) __________
> >> >
> >> > Tuto zpravu proveril ESET NOD32 Antivirus.
> >> >
> >> > http://www.eset.cz
> >> >
> >> >
> >> >
> >>
> >>
> >> --
> >> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
> >> http://www.opera.com/mail/
> >>
> >> _______________________________________________
> >> AMBER mailing list
> >> AMBER.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber
> >>
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
> > __________ Informace od ESET NOD32 Antivirus, verze databaze 8418
> > (20130606) __________
> >
> > Tuto zpravu proveril ESET NOD32 Antivirus.
> >
> > http://www.eset.cz
> >
> >
> >
>
>
> --
> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
> http://www.opera.com/mail/
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Jun 07 2013 - 13:00:02 PDT
Custom Search