Re: [AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ?

From: Marek Maly <marek.maly.ujep.cz> Date: Thu, 06 Jun 2013 12:40:55 +0200

---
memtestG80
https://github.com/ihaque/memtestG80
here is the sync fix code
https://github.com/ihaque/memtestG80/commit/c4336a69fff07945c322d6c7fc40b0b12341cc4c
---
BTW which Titan GPU are you using the stock one or the superclocked one ?
Anyway I would recommend you to recompile Amber with the latests
Amber 12 patch (bugfix 18) if you did not do it.
   M.
Dne Thu, 06 Jun 2013 12:01:35 +0200 Jonathan Gough  
<jonathan.d.gough.gmail.com> napsal/-a:
> Bad News.
>
> I ran each set of tests 4 times, nstlim=100000. FactorIX was the only one
> that gave consistent results. Again I had a few that just died without  
> any
> error messages.
>
> CentOs 6
> gnu compilers
> Cuda 5.0 and Driver Version: 319.23
> AmberTools version 13.09
>      Amber version 12.18
>
> Cellulose_production_NVE/1/mdout: Etot   =   -443246.3206  EKtot   =
>  258074.3438  EPtot      =   -701320.6644
> Cellulose_production_NVE/2/mdout  Died at 4000 steps - no error message.
> Cellulose_production_NVE/3/mdout: Etot   =   -443238.0345  EKtot   =
>  257651.0625  EPtot      =   -700889.0970
> Cellulose_production_NVE/4/mdout: Etot   =   -443246.3206  EKtot   =
>  258074.3438  EPtot      =   -701320.6644
>
> Cellulose_production_NPT/1/mdout: Etot   =   -441009.1612  EKtot   =
>  257571.2031  EPtot      =   -698580.3643
> Cellulose_production_NPT/2/mdout: Etot   =   -440947.3717  EKtot   =
>  257723.3750  EPtot      =   -698670.7467
> Cellulose_production_NPT/3/mdout: Etot   =   -441024.3259  EKtot   =
>  257406.5781  EPtot      =   -698430.9041
> Cellulose_production_NPT/4/mdout: Etot   =   -440970.6005  EKtot   =
>  257756.1250  EPtot      =   -698726.7255
>
> FactorIX_production_NVE/1/mdout: Etot   =   -234189.5802  EKtot   =
> 54845.8359  EPtot      =   -289035.4162
> FactorIX_production_NVE/2/mdout: Etot   =   -234189.5802  EKtot   =
> 54845.8359  EPtot      =   -289035.4162
> FactorIX_production_NVE/3/mdout: Etot   =   -234189.5802  EKtot   =
> 54845.8359  EPtot      =   -289035.4162
> FactorIX_production_NVE/4/mdout: Etot   =   -234189.5802  EKtot   =
> 54845.8359  EPtot      =   -289035.4162
>
> FactorIX_production_NPT/1/mdout: Etot   =   -234493.4304  EKtot   =
> 55062.0156  EPtot      =   -289555.4460
> FactorIX_production_NPT/2/mdout: Etot   =   -234493.4304  EKtot   =
> 55062.0156  EPtot      =   -289555.4460
> FactorIX_production_NPT/3/mdout: Etot   =   -234493.4304  EKtot   =
> 55062.0156  EPtot      =   -289555.4460
> FactorIX_production_NPT/4/mdout: Etot   =   -234493.4304  EKtot   =
> 55062.0156  EPtot      =   -289555.4460
>
> JAC_production_NVE/1/mdout: Etot   =    -58141.0647  EKtot   =
> 14347.6699  EPtot      =    -72488.7346
> JAC_production_NVE/2/mdout: Etot   =    -58141.4961  EKtot   =
> 14320.1465  EPtot      =    -72461.6425
> JAC_production_NVE/3/mdout: Died at 48000 steps
> JAC_production_NVE/4/mdout: Etot   =    -58141.6938  EKtot   =
> 14257.2305  EPtot      =    -72398.9243
>
> JAC_production_NPT/1/mdout: Died at 78000 steps
> JAC_production_NPT/2/mdout: Etot   =    -58206.6103  EKtot   =
> 14384.7959  EPtot      =    -72591.4062
> JAC_production_NPT/3/mdout: Etot   =    -58211.2469  EKtot   =
> 14454.1592  EPtot      =    -72665.4061
> JAC_production_NPT/1/mdout: Died at 89000 steps
>
>
> Any recommendations on what to do? Send the card back? Update drivers?
>  Update Cuda?
>
>
>
>
> On Wed, Jun 5, 2013 at 6:45 PM, Marek Maly <marek.maly.ujep.cz> wrote:
>
>> Yes you got it,
>>
>> one more thing. Check carefully the benchmark mdin files and
>> if you see there "ig=-1" just delete this, to ensure, that
>> both runs of the given test will run using the same random seed.
>>
>> (As I remember I found it just in one or two tests, don't remember which
>> one).
>>
>> Let us know your results i.e. if all the tests (JAC NVE/NPT, FACTOR_IX
>> NVE/NPT etc.)
>> successfully finished all 100K steps (in both runs) and if moreover the
>> results from both runs
>> are identical (just check the final energy).
>>
>> In case of any error (writen in mdout file or in standard output (screen
>> or nohup.out ...) ), please report it here as well.
>>
>>    Thanks,
>>
>>        M.
>>
>>
>>
>>
>>
>> Dne Thu, 06 Jun 2013 00:34:39 +0200 Jonathan Gough
>> <jonathan.d.gough.gmail.com> napsal/-a:
>>
>> > I know I'm late in the game, but I have been reading some of these two
>> > Titan threads.  I'm now attempting to test my 1 Titan card and I want  
>> to
>> > make sure I understand what I aught to be doing.
>> >
>> > Download the Amber_GPU_Benchmark_Suite
>> > in mdin, change nstlim=100000
>> > and then run the 6 benchmarks at least 2 times each
>> >
>> > yes?
>> >
>> > The issue that we have had is that simulations would just prematurely
>> > stop.
>> > We didn't see any error messages in the mdout file though, they just
>> > stopped.
>> >
>> > Were using Cuda 5.0 and Driver Version: 319.23
>> >
>> >
>> >
>> > On Wed, Jun 5, 2013 at 1:29 PM, Marek Maly <marek.maly.ujep.cz> wrote:
>> >
>> >> Hi Scott,
>> >>
>> >> thanks for update ! Let's see what will be reaction from NVIDIA.
>> >> In the worst case let's hope that also some other (NON-NVIDIA) "GPU  
>> FFT
>> >> library"
>> >> alternatives exists (to be compiled/used alternatively with  
>> pmemd.cuda)
>> >>
>> >> BTW I just found this perhaps interesting article (I only list the
>> >> supplementary part. ):
>> >>
>> >> http://www.computer.org/csdl/trans/td/preprint/06470608-abs.html
>> >>
>> >> OK, meanwhile I finished my experiment/tests with swapping my two  
>> titans
>> >> in slots. As you can see below it did not solve the problems on my
>> >> "less stable" titan, but on the other hand there is significant
>> >> improvement.
>> >> I will now try with just "my less stable" GPU  plugged on  
>> motherboard to
>> >> eventually confirm that it's less stability has origin in it's higher
>> >> sensitivity
>> >> to dual GPU configuration (OR just to dual GPU config with another  
>> Titan
>> >> maybe that
>> >> with GTX 580/680 it will be OK or at least better than with 2  
>> Titans).
>> >>
>> >>    M.
>> >>
>> >>
>> >> SIMULTANEOUS TEST (BOTH GPUS) running at the same time
>> >>
>> >> density (100K steps, NPT, restrained solute)
>> >> prod1 and prod2 (250K steps, NPT)
>> >>
>> >> TITAN_0, TITAN_1 now rather identify PCI slots than given cards.
>> >>
>> >> all the errs I have obtained here is here just:
>> >>
>> >> -----
>> >> cudaMemcpy GpuBuffer::Download failed unspecified launch failure
>> >> -----
>> >>
>> >> #1 ORIGINAL CONFIGURATION
>> >>
>> >> density           prod1            prod2
>> >>
>> >> TITAN_0
>> >> -297755.2479     -299267.1086      65K
>> >> 20K              -299411.2631     100K
>> >>
>> >> TITAN_1
>> >>   -297906.5447     -298657.3725   -298683.8965
>> >>   -297906.5447     -298657.3725   -298683.8965
>> >>
>> >>
>> >>
>> >>
>> >> #2 AFTER GPU SWAPPING (respect to PCI slots)
>> >>
>> >> density           prod1            prod2
>> >>
>> >> TITAN_0 (so these are results of the GPU named before as TITAN_1)
>> >>   -297906.5447   -298657.3725    -298683.8965
>> >>   -297906.5447   -298657.3725    -298683.8965
>> >>
>> >> TITAN_1 (so these are results of the GPU named before as TITAN_0)
>> >> -297906.5447       240K         -298764.5294
>> >> -297752.2836    -298997.8891    -299610.3812
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> Dne Wed, 05 Jun 2013 18:15:48 +0200 Scott Le Grand
>> >> <varelse2005.gmail.com>
>> >> napsal/-a:
>> >>
>> >> > Filip,
>> >> > What's happening on Titan can take a while to trigger.  I have
>> >> delivered
>> >> > a
>> >> > repro to NVIDIA that shows exactly what's happening but it's up to
>> >> them
>> >> > to
>> >> > explain why because its occurring inside cuFFT.  That's why you  
>> need
>> >> to
>> >> > run
>> >> > at least 100K iterations to see a single occurrence.
>> >> >
>> >> > There's a second issue that's happening with large GB simulations,  
>> but
>> >> > that
>> >> > one is even harder to trap.  That doesn't mean it isn't happening,
>> >> just
>> >> > that it's on the very edge of doing so on Titan.
>> >> >
>> >> > Thankfully, I have not been able to trigger either bug on GK104 or
>> >> K20...
>> >> > _______________________________________________
>> >> > AMBER mailing list
>> >> > AMBER.ambermd.org
>> >> > http://lists.ambermd.org/mailman/listinfo/amber
>> >> >
>> >> > __________ Informace od ESET NOD32 Antivirus, verze databaze 8415
>> >> > (20130605) __________
>> >> >
>> >> > Tuto zpravu proveril ESET NOD32 Antivirus.
>> >> >
>> >> > http://www.eset.cz
>> >> >
>> >> >
>> >> >
>> >>
>> >>
>> >> --
>> >> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>> >> http://www.opera.com/mail/
>> >>
>> >> _______________________________________________
>> >> AMBER mailing list
>> >> AMBER.ambermd.org
>> >> http://lists.ambermd.org/mailman/listinfo/amber
>> >>
>> > _______________________________________________
>> > AMBER mailing list
>> > AMBER.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber
>> >
>> > __________ Informace od ESET NOD32 Antivirus, verze databaze 8416
>> > (20130605) __________
>> >
>> > Tuto zpravu proveril ESET NOD32 Antivirus.
>> >
>> > http://www.eset.cz
>> >
>> >
>> >
>>
>>
>> --
>> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>> http://www.opera.com/mail/
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
> __________ Informace od ESET NOD32 Antivirus, verze databaze 8417  
> (20130606) __________
>
> Tuto zpravu proveril ESET NOD32 Antivirus.
>
> http://www.eset.cz
>
>
>
-- 
Tato zpráva byla vytvořena převratným poštovním klientem Opery:  
http://www.opera.com/mail/
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber