Re: [AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ?

From: Marek Maly <marek.maly.ujep.cz>
Date: Thu, 06 Jun 2013 12:40:55 +0200

Welcome in the club :))

First of all do not panic. Scott recently identified and reported
some cuFFT "bug" in connection with Titans and sent it to NVIDIA,
now we have to wait what the NVIDIA experts answer. There is also another
Amber/Titan issue
which has some another origin (GB of big systems i.e. NUCLEOSOME) you may
try it
as well. Amber guys are working perhaps also on that.

So on your place I would wait with RMAing unless you have any other
indications
that your GPU might me damaged. In between you may do some tests of this
GPU with memtestG80.

here is the most recent version:

---
memtestG80
https://github.com/ihaque/memtestG80
here is the sync fix code
https://github.com/ihaque/memtestG80/commit/c4336a69fff07945c322d6c7fc40b0b12341cc4c
---
BTW which Titan GPU are you using the stock one or the superclocked one ?
Anyway I would recommend you to recompile Amber with the latests
Amber 12 patch (bugfix 18) if you did not do it.
   M.
Dne Thu, 06 Jun 2013 12:01:35 +0200 Jonathan Gough  
<jonathan.d.gough.gmail.com> napsal/-a:
> Bad News.
>
> I ran each set of tests 4 times, nstlim=100000. FactorIX was the only one
> that gave consistent results. Again I had a few that just died without  
> any
> error messages.
>
> CentOs 6
> gnu compilers
> Cuda 5.0 and Driver Version: 319.23
> AmberTools version 13.09
>      Amber version 12.18
>
> Cellulose_production_NVE/1/mdout: Etot   =   -443246.3206  EKtot   =
>  258074.3438  EPtot      =   -701320.6644
> Cellulose_production_NVE/2/mdout  Died at 4000 steps - no error message.
> Cellulose_production_NVE/3/mdout: Etot   =   -443238.0345  EKtot   =
>  257651.0625  EPtot      =   -700889.0970
> Cellulose_production_NVE/4/mdout: Etot   =   -443246.3206  EKtot   =
>  258074.3438  EPtot      =   -701320.6644
>
> Cellulose_production_NPT/1/mdout: Etot   =   -441009.1612  EKtot   =
>  257571.2031  EPtot      =   -698580.3643
> Cellulose_production_NPT/2/mdout: Etot   =   -440947.3717  EKtot   =
>  257723.3750  EPtot      =   -698670.7467
> Cellulose_production_NPT/3/mdout: Etot   =   -441024.3259  EKtot   =
>  257406.5781  EPtot      =   -698430.9041
> Cellulose_production_NPT/4/mdout: Etot   =   -440970.6005  EKtot   =
>  257756.1250  EPtot      =   -698726.7255
>
> FactorIX_production_NVE/1/mdout: Etot   =   -234189.5802  EKtot   =
> 54845.8359  EPtot      =   -289035.4162
> FactorIX_production_NVE/2/mdout: Etot   =   -234189.5802  EKtot   =
> 54845.8359  EPtot      =   -289035.4162
> FactorIX_production_NVE/3/mdout: Etot   =   -234189.5802  EKtot   =
> 54845.8359  EPtot      =   -289035.4162
> FactorIX_production_NVE/4/mdout: Etot   =   -234189.5802  EKtot   =
> 54845.8359  EPtot      =   -289035.4162
>
> FactorIX_production_NPT/1/mdout: Etot   =   -234493.4304  EKtot   =
> 55062.0156  EPtot      =   -289555.4460
> FactorIX_production_NPT/2/mdout: Etot   =   -234493.4304  EKtot   =
> 55062.0156  EPtot      =   -289555.4460
> FactorIX_production_NPT/3/mdout: Etot   =   -234493.4304  EKtot   =
> 55062.0156  EPtot      =   -289555.4460
> FactorIX_production_NPT/4/mdout: Etot   =   -234493.4304  EKtot   =
> 55062.0156  EPtot      =   -289555.4460
>
> JAC_production_NVE/1/mdout: Etot   =    -58141.0647  EKtot   =
> 14347.6699  EPtot      =    -72488.7346
> JAC_production_NVE/2/mdout: Etot   =    -58141.4961  EKtot   =
> 14320.1465  EPtot      =    -72461.6425
> JAC_production_NVE/3/mdout: Died at 48000 steps
> JAC_production_NVE/4/mdout: Etot   =    -58141.6938  EKtot   =
> 14257.2305  EPtot      =    -72398.9243
>
> JAC_production_NPT/1/mdout: Died at 78000 steps
> JAC_production_NPT/2/mdout: Etot   =    -58206.6103  EKtot   =
> 14384.7959  EPtot      =    -72591.4062
> JAC_production_NPT/3/mdout: Etot   =    -58211.2469  EKtot   =
> 14454.1592  EPtot      =    -72665.4061
> JAC_production_NPT/1/mdout: Died at 89000 steps
>
>
> Any recommendations on what to do? Send the card back? Update drivers?
>  Update Cuda?
>
>
>
>
> On Wed, Jun 5, 2013 at 6:45 PM, Marek Maly <marek.maly.ujep.cz> wrote:
>
>> Yes you got it,
>>
>> one more thing. Check carefully the benchmark mdin files and
>> if you see there "ig=-1" just delete this, to ensure, that
>> both runs of the given test will run using the same random seed.
>>
>> (As I remember I found it just in one or two tests, don't remember which
>> one).
>>
>> Let us know your results i.e. if all the tests (JAC NVE/NPT, FACTOR_IX
>> NVE/NPT etc.)
>> successfully finished all 100K steps (in both runs) and if moreover the
>> results from both runs
>> are identical (just check the final energy).
>>
>> In case of any error (writen in mdout file or in standard output (screen
>> or nohup.out ...) ), please report it here as well.
>>
>>    Thanks,
>>
>>        M.
>>
>>
>>
>>
>>
>> Dne Thu, 06 Jun 2013 00:34:39 +0200 Jonathan Gough
>> <jonathan.d.gough.gmail.com> napsal/-a:
>>
>> > I know I'm late in the game, but I have been reading some of these two
>> > Titan threads.  I'm now attempting to test my 1 Titan card and I want  
>> to
>> > make sure I understand what I aught to be doing.
>> >
>> > Download the Amber_GPU_Benchmark_Suite
>> > in mdin, change nstlim=100000
>> > and then run the 6 benchmarks at least 2 times each
>> >
>> > yes?
>> >
>> > The issue that we have had is that simulations would just prematurely
>> > stop.
>> > We didn't see any error messages in the mdout file though, they just
>> > stopped.
>> >
>> > Were using Cuda 5.0 and Driver Version: 319.23
>> >
>> >
>> >
>> > On Wed, Jun 5, 2013 at 1:29 PM, Marek Maly <marek.maly.ujep.cz> wrote:
>> >
>> >> Hi Scott,
>> >>
>> >> thanks for update ! Let's see what will be reaction from NVIDIA.
>> >> In the worst case let's hope that also some other (NON-NVIDIA) "GPU  
>> FFT
>> >> library"
>> >> alternatives exists (to be compiled/used alternatively with  
>> pmemd.cuda)
>> >>
>> >> BTW I just found this perhaps interesting article (I only list the
>> >> supplementary part. ):
>> >>
>> >> http://www.computer.org/csdl/trans/td/preprint/06470608-abs.html
>> >>
>> >> OK, meanwhile I finished my experiment/tests with swapping my two  
>> titans
>> >> in slots. As you can see below it did not solve the problems on my
>> >> "less stable" titan, but on the other hand there is significant
>> >> improvement.
>> >> I will now try with just "my less stable" GPU  plugged on  
>> motherboard to
>> >> eventually confirm that it's less stability has origin in it's higher
>> >> sensitivity
>> >> to dual GPU configuration (OR just to dual GPU config with another  
>> Titan
>> >> maybe that
>> >> with GTX 580/680 it will be OK or at least better than with 2  
>> Titans).
>> >>
>> >>    M.
>> >>
>> >>
>> >> SIMULTANEOUS TEST (BOTH GPUS) running at the same time
>> >>
>> >> density (100K steps, NPT, restrained solute)
>> >> prod1 and prod2 (250K steps, NPT)
>> >>
>> >> TITAN_0, TITAN_1 now rather identify PCI slots than given cards.
>> >>
>> >> all the errs I have obtained here is here just:
>> >>
>> >> -----
>> >> cudaMemcpy GpuBuffer::Download failed unspecified launch failure
>> >> -----
>> >>
>> >> #1 ORIGINAL CONFIGURATION
>> >>
>> >> density           prod1            prod2
>> >>
>> >> TITAN_0
>> >> -297755.2479     -299267.1086      65K
>> >> 20K              -299411.2631     100K
>> >>
>> >> TITAN_1
>> >>   -297906.5447     -298657.3725   -298683.8965
>> >>   -297906.5447     -298657.3725   -298683.8965
>> >>
>> >>
>> >>
>> >>
>> >> #2 AFTER GPU SWAPPING (respect to PCI slots)
>> >>
>> >> density           prod1            prod2
>> >>
>> >> TITAN_0 (so these are results of the GPU named before as TITAN_1)
>> >>   -297906.5447   -298657.3725    -298683.8965
>> >>   -297906.5447   -298657.3725    -298683.8965
>> >>
>> >> TITAN_1 (so these are results of the GPU named before as TITAN_0)
>> >> -297906.5447       240K         -298764.5294
>> >> -297752.2836    -298997.8891    -299610.3812
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> Dne Wed, 05 Jun 2013 18:15:48 +0200 Scott Le Grand
>> >> <varelse2005.gmail.com>
>> >> napsal/-a:
>> >>
>> >> > Filip,
>> >> > What's happening on Titan can take a while to trigger.  I have
>> >> delivered
>> >> > a
>> >> > repro to NVIDIA that shows exactly what's happening but it's up to
>> >> them
>> >> > to
>> >> > explain why because its occurring inside cuFFT.  That's why you  
>> need
>> >> to
>> >> > run
>> >> > at least 100K iterations to see a single occurrence.
>> >> >
>> >> > There's a second issue that's happening with large GB simulations,  
>> but
>> >> > that
>> >> > one is even harder to trap.  That doesn't mean it isn't happening,
>> >> just
>> >> > that it's on the very edge of doing so on Titan.
>> >> >
>> >> > Thankfully, I have not been able to trigger either bug on GK104 or
>> >> K20...
>> >> > _______________________________________________
>> >> > AMBER mailing list
>> >> > AMBER.ambermd.org
>> >> > http://lists.ambermd.org/mailman/listinfo/amber
>> >> >
>> >> > __________ Informace od ESET NOD32 Antivirus, verze databaze 8415
>> >> > (20130605) __________
>> >> >
>> >> > Tuto zpravu proveril ESET NOD32 Antivirus.
>> >> >
>> >> > http://www.eset.cz
>> >> >
>> >> >
>> >> >
>> >>
>> >>
>> >> --
>> >> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>> >> http://www.opera.com/mail/
>> >>
>> >> _______________________________________________
>> >> AMBER mailing list
>> >> AMBER.ambermd.org
>> >> http://lists.ambermd.org/mailman/listinfo/amber
>> >>
>> > _______________________________________________
>> > AMBER mailing list
>> > AMBER.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber
>> >
>> > __________ Informace od ESET NOD32 Antivirus, verze databaze 8416
>> > (20130605) __________
>> >
>> > Tuto zpravu proveril ESET NOD32 Antivirus.
>> >
>> > http://www.eset.cz
>> >
>> >
>> >
>>
>>
>> --
>> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>> http://www.opera.com/mail/
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
> __________ Informace od ESET NOD32 Antivirus, verze databaze 8417  
> (20130606) __________
>
> Tuto zpravu proveril ESET NOD32 Antivirus.
>
> http://www.eset.cz
>
>
>
-- 
Tato zpráva byla vytvořena převratným poštovním klientem Opery:  
http://www.opera.com/mail/
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jun 06 2013 - 04:00:02 PDT
Custom Search