OK, thanks Scott !
meanwhile I did my test with "less stable" Titan in solo configuration
and it did not significantly improve the situation ...
M.
Dne Thu, 06 Jun 2013 16:23:30 +0200 Scott Le Grand <varelse2005.gmail.com>
napsal/-a:
> My *suspicion* is that the GB issue and the FFT issue have the same root
> cause. The difference is that the GB issue happens in my own code so I'm
> continuing to investigate but since the FFT code is NVIDIA's all I could
> do
> was create a repro app for them and hand it off to them. We'll find a
> fix
> and/or a workaround, we always do...
>
> Scott
>
>
>
>
> On Thu, Jun 6, 2013 at 3:40 AM, Marek Maly <marek.maly.ujep.cz> wrote:
>
>> Welcome in the club :))
>>
>> First of all do not panic. Scott recently identified and reported
>> some cuFFT "bug" in connection with Titans and sent it to NVIDIA,
>> now we have to wait what the NVIDIA experts answer. There is also
>> another
>> Amber/Titan issue
>> which has some another origin (GB of big systems i.e. NUCLEOSOME) you
>> may
>> try it
>> as well. Amber guys are working perhaps also on that.
>>
>> So on your place I would wait with RMAing unless you have any other
>> indications
>> that your GPU might me damaged. In between you may do some tests of this
>> GPU with memtestG80.
>>
>> here is the most recent version:
>>
>> ---
>> memtestG80
>> https://github.com/ihaque/memtestG80
>> here is the sync fix code
>>
>> https://github.com/ihaque/memtestG80/commit/c4336a69fff07945c322d6c7fc40b0b12341cc4c
>> ---
>>
>> BTW which Titan GPU are you using the stock one or the superclocked one
>> ?
>>
>> Anyway I would recommend you to recompile Amber with the latests
>> Amber 12 patch (bugfix 18) if you did not do it.
>>
>> M.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Dne Thu, 06 Jun 2013 12:01:35 +0200 Jonathan Gough
>> <jonathan.d.gough.gmail.com> napsal/-a:
>>
>> > Bad News.
>> >
>> > I ran each set of tests 4 times, nstlim=100000. FactorIX was the only
>> one
>> > that gave consistent results. Again I had a few that just died without
>> > any
>> > error messages.
>> >
>> > CentOs 6
>> > gnu compilers
>> > Cuda 5.0 and Driver Version: 319.23
>> > AmberTools version 13.09
>> > Amber version 12.18
>> >
>> > Cellulose_production_NVE/1/mdout: Etot = -443246.3206 EKtot =
>> > 258074.3438 EPtot = -701320.6644
>> > Cellulose_production_NVE/2/mdout Died at 4000 steps - no error
>> message.
>> > Cellulose_production_NVE/3/mdout: Etot = -443238.0345 EKtot =
>> > 257651.0625 EPtot = -700889.0970
>> > Cellulose_production_NVE/4/mdout: Etot = -443246.3206 EKtot =
>> > 258074.3438 EPtot = -701320.6644
>> >
>> > Cellulose_production_NPT/1/mdout: Etot = -441009.1612 EKtot =
>> > 257571.2031 EPtot = -698580.3643
>> > Cellulose_production_NPT/2/mdout: Etot = -440947.3717 EKtot =
>> > 257723.3750 EPtot = -698670.7467
>> > Cellulose_production_NPT/3/mdout: Etot = -441024.3259 EKtot =
>> > 257406.5781 EPtot = -698430.9041
>> > Cellulose_production_NPT/4/mdout: Etot = -440970.6005 EKtot =
>> > 257756.1250 EPtot = -698726.7255
>> >
>> > FactorIX_production_NVE/1/mdout: Etot = -234189.5802 EKtot =
>> > 54845.8359 EPtot = -289035.4162
>> > FactorIX_production_NVE/2/mdout: Etot = -234189.5802 EKtot =
>> > 54845.8359 EPtot = -289035.4162
>> > FactorIX_production_NVE/3/mdout: Etot = -234189.5802 EKtot =
>> > 54845.8359 EPtot = -289035.4162
>> > FactorIX_production_NVE/4/mdout: Etot = -234189.5802 EKtot =
>> > 54845.8359 EPtot = -289035.4162
>> >
>> > FactorIX_production_NPT/1/mdout: Etot = -234493.4304 EKtot =
>> > 55062.0156 EPtot = -289555.4460
>> > FactorIX_production_NPT/2/mdout: Etot = -234493.4304 EKtot =
>> > 55062.0156 EPtot = -289555.4460
>> > FactorIX_production_NPT/3/mdout: Etot = -234493.4304 EKtot =
>> > 55062.0156 EPtot = -289555.4460
>> > FactorIX_production_NPT/4/mdout: Etot = -234493.4304 EKtot =
>> > 55062.0156 EPtot = -289555.4460
>> >
>> > JAC_production_NVE/1/mdout: Etot = -58141.0647 EKtot =
>> > 14347.6699 EPtot = -72488.7346
>> > JAC_production_NVE/2/mdout: Etot = -58141.4961 EKtot =
>> > 14320.1465 EPtot = -72461.6425
>> > JAC_production_NVE/3/mdout: Died at 48000 steps
>> > JAC_production_NVE/4/mdout: Etot = -58141.6938 EKtot =
>> > 14257.2305 EPtot = -72398.9243
>> >
>> > JAC_production_NPT/1/mdout: Died at 78000 steps
>> > JAC_production_NPT/2/mdout: Etot = -58206.6103 EKtot =
>> > 14384.7959 EPtot = -72591.4062
>> > JAC_production_NPT/3/mdout: Etot = -58211.2469 EKtot =
>> > 14454.1592 EPtot = -72665.4061
>> > JAC_production_NPT/1/mdout: Died at 89000 steps
>> >
>> >
>> > Any recommendations on what to do? Send the card back? Update drivers?
>> > Update Cuda?
>> >
>> >
>> >
>> >
>> > On Wed, Jun 5, 2013 at 6:45 PM, Marek Maly <marek.maly.ujep.cz> wrote:
>> >
>> >> Yes you got it,
>> >>
>> >> one more thing. Check carefully the benchmark mdin files and
>> >> if you see there "ig=-1" just delete this, to ensure, that
>> >> both runs of the given test will run using the same random seed.
>> >>
>> >> (As I remember I found it just in one or two tests, don't remember
>> which
>> >> one).
>> >>
>> >> Let us know your results i.e. if all the tests (JAC NVE/NPT,
>> FACTOR_IX
>> >> NVE/NPT etc.)
>> >> successfully finished all 100K steps (in both runs) and if moreover
>> the
>> >> results from both runs
>> >> are identical (just check the final energy).
>> >>
>> >> In case of any error (writen in mdout file or in standard output
>> (screen
>> >> or nohup.out ...) ), please report it here as well.
>> >>
>> >> Thanks,
>> >>
>> >> M.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> Dne Thu, 06 Jun 2013 00:34:39 +0200 Jonathan Gough
>> >> <jonathan.d.gough.gmail.com> napsal/-a:
>> >>
>> >> > I know I'm late in the game, but I have been reading some of these
>> two
>> >> > Titan threads. I'm now attempting to test my 1 Titan card and I
>> want
>> >> to
>> >> > make sure I understand what I aught to be doing.
>> >> >
>> >> > Download the Amber_GPU_Benchmark_Suite
>> >> > in mdin, change nstlim=100000
>> >> > and then run the 6 benchmarks at least 2 times each
>> >> >
>> >> > yes?
>> >> >
>> >> > The issue that we have had is that simulations would just
>> prematurely
>> >> > stop.
>> >> > We didn't see any error messages in the mdout file though, they
>> just
>> >> > stopped.
>> >> >
>> >> > Were using Cuda 5.0 and Driver Version: 319.23
>> >> >
>> >> >
>> >> >
>> >> > On Wed, Jun 5, 2013 at 1:29 PM, Marek Maly <marek.maly.ujep.cz>
>> wrote:
>> >> >
>> >> >> Hi Scott,
>> >> >>
>> >> >> thanks for update ! Let's see what will be reaction from NVIDIA.
>> >> >> In the worst case let's hope that also some other (NON-NVIDIA)
>> "GPU
>> >> FFT
>> >> >> library"
>> >> >> alternatives exists (to be compiled/used alternatively with
>> >> pmemd.cuda)
>> >> >>
>> >> >> BTW I just found this perhaps interesting article (I only list the
>> >> >> supplementary part. ):
>> >> >>
>> >> >> http://www.computer.org/csdl/trans/td/preprint/06470608-abs.html
>> >> >>
>> >> >> OK, meanwhile I finished my experiment/tests with swapping my two
>> >> titans
>> >> >> in slots. As you can see below it did not solve the problems on my
>> >> >> "less stable" titan, but on the other hand there is significant
>> >> >> improvement.
>> >> >> I will now try with just "my less stable" GPU plugged on
>> >> motherboard to
>> >> >> eventually confirm that it's less stability has origin in it's
>> higher
>> >> >> sensitivity
>> >> >> to dual GPU configuration (OR just to dual GPU config with another
>> >> Titan
>> >> >> maybe that
>> >> >> with GTX 580/680 it will be OK or at least better than with 2
>> >> Titans).
>> >> >>
>> >> >> M.
>> >> >>
>> >> >>
>> >> >> SIMULTANEOUS TEST (BOTH GPUS) running at the same time
>> >> >>
>> >> >> density (100K steps, NPT, restrained solute)
>> >> >> prod1 and prod2 (250K steps, NPT)
>> >> >>
>> >> >> TITAN_0, TITAN_1 now rather identify PCI slots than given cards.
>> >> >>
>> >> >> all the errs I have obtained here is here just:
>> >> >>
>> >> >> -----
>> >> >> cudaMemcpy GpuBuffer::Download failed unspecified launch failure
>> >> >> -----
>> >> >>
>> >> >> #1 ORIGINAL CONFIGURATION
>> >> >>
>> >> >> density prod1 prod2
>> >> >>
>> >> >> TITAN_0
>> >> >> -297755.2479 -299267.1086 65K
>> >> >> 20K -299411.2631 100K
>> >> >>
>> >> >> TITAN_1
>> >> >> -297906.5447 -298657.3725 -298683.8965
>> >> >> -297906.5447 -298657.3725 -298683.8965
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> #2 AFTER GPU SWAPPING (respect to PCI slots)
>> >> >>
>> >> >> density prod1 prod2
>> >> >>
>> >> >> TITAN_0 (so these are results of the GPU named before as TITAN_1)
>> >> >> -297906.5447 -298657.3725 -298683.8965
>> >> >> -297906.5447 -298657.3725 -298683.8965
>> >> >>
>> >> >> TITAN_1 (so these are results of the GPU named before as TITAN_0)
>> >> >> -297906.5447 240K -298764.5294
>> >> >> -297752.2836 -298997.8891 -299610.3812
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> Dne Wed, 05 Jun 2013 18:15:48 +0200 Scott Le Grand
>> >> >> <varelse2005.gmail.com>
>> >> >> napsal/-a:
>> >> >>
>> >> >> > Filip,
>> >> >> > What's happening on Titan can take a while to trigger. I have
>> >> >> delivered
>> >> >> > a
>> >> >> > repro to NVIDIA that shows exactly what's happening but it's up
>> to
>> >> >> them
>> >> >> > to
>> >> >> > explain why because its occurring inside cuFFT. That's why you
>> >> need
>> >> >> to
>> >> >> > run
>> >> >> > at least 100K iterations to see a single occurrence.
>> >> >> >
>> >> >> > There's a second issue that's happening with large GB
>> simulations,
>> >> but
>> >> >> > that
>> >> >> > one is even harder to trap. That doesn't mean it isn't
>> happening,
>> >> >> just
>> >> >> > that it's on the very edge of doing so on Titan.
>> >> >> >
>> >> >> > Thankfully, I have not been able to trigger either bug on GK104
>> or
>> >> >> K20...
>> >> >> > _______________________________________________
>> >> >> > AMBER mailing list
>> >> >> > AMBER.ambermd.org
>> >> >> > http://lists.ambermd.org/mailman/listinfo/amber
>> >> >> >
>> >> >> > __________ Informace od ESET NOD32 Antivirus, verze databaze
>> 8415
>> >> >> > (20130605) __________
>> >> >> >
>> >> >> > Tuto zpravu proveril ESET NOD32 Antivirus.
>> >> >> >
>> >> >> > http://www.eset.cz
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>> >> >> http://www.opera.com/mail/
>> >> >>
>> >> >> _______________________________________________
>> >> >> AMBER mailing list
>> >> >> AMBER.ambermd.org
>> >> >> http://lists.ambermd.org/mailman/listinfo/amber
>> >> >>
>> >> > _______________________________________________
>> >> > AMBER mailing list
>> >> > AMBER.ambermd.org
>> >> > http://lists.ambermd.org/mailman/listinfo/amber
>> >> >
>> >> > __________ Informace od ESET NOD32 Antivirus, verze databaze 8416
>> >> > (20130605) __________
>> >> >
>> >> > Tuto zpravu proveril ESET NOD32 Antivirus.
>> >> >
>> >> > http://www.eset.cz
>> >> >
>> >> >
>> >> >
>> >>
>> >>
>> >> --
>> >> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>> >> http://www.opera.com/mail/
>> >>
>> >> _______________________________________________
>> >> AMBER mailing list
>> >> AMBER.ambermd.org
>> >> http://lists.ambermd.org/mailman/listinfo/amber
>> >>
>> > _______________________________________________
>> > AMBER mailing list
>> > AMBER.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber
>> >
>> > __________ Informace od ESET NOD32 Antivirus, verze databaze 8417
>> > (20130606) __________
>> >
>> > Tuto zpravu proveril ESET NOD32 Antivirus.
>> >
>> > http://www.eset.cz
>> >
>> >
>> >
>>
>>
>> --
>> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>> http://www.opera.com/mail/
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
> __________ Informace od ESET NOD32 Antivirus, verze databaze 8418
> (20130606) __________
>
> Tuto zpravu proveril ESET NOD32 Antivirus.
>
> http://www.eset.cz
>
>
>
--
Tato zpráva byla vytvořena převratným poštovním klientem Opery:
http://www.opera.com/mail/
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jun 06 2013 - 08:00:03 PDT