Re: [AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ?

From: Marek Maly <marek.maly.ujep.cz>
Date: Wed, 12 Jun 2013 20:16:46 +0200

Hi Scott,
thanks for the update !

Regarding this your own research/findings please do not forget also
  to send your messages in copy to the NVIDIA guys to eventually accelerate
  their debugging work :))


     M.






Dne Wed, 12 Jun 2013 19:07:49 +0200 Scott Le Grand <varelse2005.gmail.com>
napsal/-a:

> So using the read-only cache doesn't fix the problem. It's interesting
> in
> that the compiler doesn't always handle doing so properly. Theoretically
> declaring a pointer to be const __restrict__ should activate it. Well,
> sort of, sometimes...
>
> That said, using the "__ldg" intrinsic to load data works around the
> problem. It looks like this, instead of x = *p, use x = __ldg(p)...
>
> NVIDIA is taking a while to figure this out because the repro case runs
> for
> 3-10 hours before it occurs. The issues I already fixed take 10-60
> minutes. Wish them luck! This is the sort of bug that drives an
> engineer
> batty...
>
> Scott
>
>
>
>
> On Tue, Jun 11, 2013 at 4:38 AM, filip fratev <filipfratev.yahoo.com>
> wrote:
>
>> Hi Scott,
>> Thanks a lot also from me for your update!
>>
>> Regards,
>> Filip
>>
>>
>> ________________________________
>> From: Marek Maly <marek.maly.ujep.cz>
>> To: AMBER Mailing List <amber.ambermd.org>
>> Sent: Tuesday, June 11, 2013 1:23 PM
>> Subject: Re: [AMBER] experiences with EVGA GTX TITAN Superclocked -
>> memtestG80 - UNDERclocking in Linux ?
>>
>>
>> Hi Scott,
>> thanks for update !
>>
>> It's good starting point that the NVIDIA guys were able
>> to reproduce "cuFFT" errs on Titan GPU.
>>
>> Thanks also for your personal effort and let's hope
>> that this issue will be resolved soon.
>>
>> M.
>>
>>
>>
>>
>> Dne Tue, 11 Jun 2013 06:31:37 +0200 Scott Le Grand
>> <varelse2005.gmail.com
>> >
>> napsal/-a:
>>
>> > So the issue is now reproed at NVIDIA and I'm playing with a GK110
>> > feature
>> > called the read-only data cache as an alternative to using the texture
>> > unit
>> > (the apparent root cause). It's a slightly different path through the
>> > hw.
>> > I doubt it will change anything, but it's worth a shot.
>> >
>> >
>> >
>> >
>> > On Sun, Jun 9, 2013 at 8:22 AM, ET <sketchfoot.gmail.com> wrote:
>> >
>> >> Nice one Scott! Thanks for sorting this out! :)
>> >>
>> >>
>> >> On 7 June 2013 23:50, Scott Le Grand <varelse2005.gmail.com> wrote:
>> >>
>> >> > All sorts of possible explanations: better binning, different
>> ASICs,
>> >> > different process, dumb luck, etc...
>> >> >
>> >> >
>> >> >
>> >> > On Fri, Jun 7, 2013 at 3:41 PM, filip fratev
>> <filipfratev.yahoo.com>
>> >> > wrote:
>> >> >
>> >> > > I am curious why GTX780 works, but Titan not..i.e. what might be
>> the
>> >> > > specific reason for cuFFT/Titan problem?
>> >> > >
>> >> > > Regards,
>> >> > > F.
>> >> > >
>> >> > >
>> >> > > ________________________________
>> >> > > From: Scott Le Grand <varelse2005.gmail.com>
>> >> > > To: AMBER Mailing List <amber.ambermd.org>
>> >> > > Sent: Saturday, June 8, 2013 1:05 AM
>> >> > > Subject: Re: [AMBER] experiences with EVGA GTX TITAN
>> Superclocked -
>> >> > > memtestG80 - UNDERclocking in Linux ?
>> >> > >
>> >> > >
>> >> > > Jonathan: Oh ye of little faith...
>> >> > >
>> >> > > They just got the thing running at their end, give 'em a some
>> time.
>> >> > CuFFT
>> >> > > is mission critical to CUDA - they'll fix it...
>> >> > >
>> >> > >
>> >> > >
>> >> > >
>> >> > >
>> >> > >
>> >> > >
>> >> > > On Fri, Jun 7, 2013 at 2:40 PM, Jonathan Gough
>> >> > > <jonathan.d.gough.gmail.com>wrote:
>> >> > >
>> >> > > > I wonder if could i trade in my titan for a gtx 780...
>> >> > > >
>> >> > > >
>> >> > > > On Fri, Jun 7, 2013 at 5:16 PM, Marek Maly <marek.maly.ujep.cz>
>> >> wrote:
>> >> > > >
>> >> > > > > Thanks Scott for good news !
>> >> > > > >
>> >> > > > > Let's hope that guys from NVIDIA resolve
>> >> > > > > the cuFFT/TITAN problem before the new
>> >> > > > > chip architecture is released :))
>> >> > > > >
>> >> > > > > M.
>> >> > > > >
>> >> > > > >
>> >> > > > >
>> >> > > > >
>> >> > > > >
>> >> > > > > Dne Fri, 07 Jun 2013 22:45:50 +0200 Scott Le Grand <
>> >> > > > varelse2005.gmail.com>
>> >> > > > > napsal/-a:
>> >> > > > >
>> >> > > > > > Really really interesting...
>> >> > > > > >
>> >> > > > > > I seem to have found a fix for the GB issues on my Titan -
>> >> not so
>> >> > > > > > surprisingly, it's the same fix as on GTX4xx/GTX5xx...
>> >> > > > > >
>> >> > > > > > But this doesn't yet explain the weirdness with cuFFT so
>> we're
>> >> not
>> >> > > done
>> >> > > > > > here yet...
>> >> > > > > >
>> >> > > > > >
>> >> > > > > >
>> >> > > > > > On Fri, Jun 7, 2013 at 12:48 PM, Jonathan Gough
>> >> > > > > > <jonathan.d.gough.gmail.com>wrote:
>> >> > > > > >
>> >> > > > > >> Good News (maybe)
>> >> > > > > >>
>> >> > > > > >> 1. The nucleosome calculations were reproducible at
>> >> > nstlim=100000
>> >> > > > > >> 2. My new GTX 780 seems to be stable See results below
>> >> > > > > >>
>> >> > > > > >> CentOs 6
>> >> > > > > >> gnu compilers
>> >> > > > > >> Cuda 5.0 and Driver Version: 319.23
>> >> > > > > >> AmberTools version 13.09
>> >> > > > > >> Amber version 12.18
>> >> > > > > >>
>> >> > > > > >> EVGA 06G-P4-2793-KR GeForce GTX TITAN
>> >> > > > > >> GB-Nucleosome
>> >> > > > > >> nucleosome/1/mdout: Etot = -66858.7444 EKtot =
>> >> > > 19709.4492
>> >> > > > > >> EPtot = -86568.1936
>> >> > > > > >> nucleosome/2/mdout: Etot = -66858.7444 EKtot =
>> >> > > 19709.4492
>> >> > > > > >> EPtot = -86568.1936
>> >> > > > > >> nucleosome/3/mdout: Etot = -66858.7444 EKtot =
>> >> > > 19709.4492
>> >> > > > > >> EPtot = -86568.1936
>> >> > > > > >> nucleosome/4/mdout: Etot = -66858.7444 EKtot =
>> >> > > 19709.4492
>> >> > > > > >> EPtot = -86568.1936
>> >> > > > > >>
>> >> > > > > >>
>> >> > > > > >> FWIW: Here is data for GTX 780
>> >> > > > > >> EVGA 03G-P4-2781-KR GeForce GTX 780
>> >> > > > > >> Ran each of the tests at nstlim=100000 4x
>> >> > > > > >>
>> >> > > > > >> Not that I know if there was an issue, but paranoia has
>> set
>> >> in,
>> >> > and
>> >> > > I
>> >> > > > > >> felt
>> >> > > > > >> the need to be comprehensive
>> >> > > > > >> Everything is looking reproducible.
>> >> > > > > >>
>> >> > > > > >> JAC_production_NPT/1/mdout: Etot = -58221.1921 EKtot
>> >> =
>> >> > > > > >> 14415.7754 EPtot = -72636.9675
>> >> > > > > >> JAC_production_NPT/2/mdout: Etot = -58221.1921 EKtot
>> >> =
>> >> > > > > >> 14415.7754 EPtot = -72636.9675
>> >> > > > > >> JAC_production_NPT/3/mdout: Etot = -58221.1921 EKtot
>> >> =
>> >> > > > > >> 14415.7754 EPtot = -72636.9675
>> >> > > > > >> JAC_production_NPT/4/mdout: Etot = -58221.1921 EKtot
>> >> =
>> >> > > > > >> 14415.7754 EPtot = -72636.9675
>> >> > > > > >>
>> >> > > > > >> JAC_production_NVE/1/mdout: Etot = -58139.8773 EKtot
>> >> =
>> >> > > > > >> 14266.4307 EPtot = -72406.3079
>> >> > > > > >> JAC_production_NVE/2/mdout: Etot = -58139.8773 EKtot
>> >> =
>> >> > > > > >> 14266.4307 EPtot = -72406.3079
>> >> > > > > >> JAC_production_NVE/3/mdout: Etot = -58139.8773 EKtot
>> >> =
>> >> > > > > >> 14266.4307 EPtot = -72406.3079
>> >> > > > > >> JAC_production_NVE/4/mdout: Etot = -58139.8773 EKtot
>> >> =
>> >> > > > > >> 14266.4307 EPtot = -72406.3079
>> >> > > > > >>
>> >> > > > > >> FactorIX_production_NVE/1/mdout: Etot = -234189.5802
>> >> EKtot
>> >> > =
>> >> > > > > >> 54845.8359 EPtot = -289035.4162
>> >> > > > > >> FactorIX_production_NVE/2/mdout: Etot = -234189.5802
>> >> EKtot
>> >> > =
>> >> > > > > >> 54845.8359 EPtot = -289035.4162
>> >> > > > > >> FactorIX_production_NVE/3/mdout: Etot = -234189.5802
>> >> EKtot
>> >> > =
>> >> > > > > >> 54845.8359 EPtot = -289035.4162
>> >> > > > > >> FactorIX_production_NVE/4/mdout: Etot = -234189.5802
>> >> EKtot
>> >> > =
>> >> > > > > >> 54845.8359 EPtot = -289035.4162
>> >> > > > > >>
>> >> > > > > >> FactorIX_production_NPT/1/mdout: Etot = -234493.4304
>> >> EKtot
>> >> > =
>> >> > > > > >> 55062.0156 EPtot = -289555.4460
>> >> > > > > >> FactorIX_production_NPT/2/mdout: Etot = -234493.4304
>> >> EKtot
>> >> > =
>> >> > > > > >> 55062.0156 EPtot = -289555.4460
>> >> > > > > >> FactorIX_production_NPT/3/mdout: Etot = -234493.4304
>> >> EKtot
>> >> > =
>> >> > > > > >> 55062.0156 EPtot = -289555.4460
>> >> > > > > >> FactorIX_production_NPT/4/mdout: Etot = -234493.4304
>> >> EKtot
>> >> > =
>> >> > > > > >> 55062.0156 EPtot = -289555.4460
>> >> > > > > >>
>> >> > > > > >>
>> >> > > > > >> Cellulose_production_NPT/1/mdout: Etot = -441074.6000
>> >> EKtot
>> >> > =
>> >> > > > > >> 258388.7500 EPtot = -699463.3500
>> >> > > > > >> Cellulose_production_NPT/2/mdout: Etot = -441074.6000
>> >> EKtot
>> >> > =
>> >> > > > > >> 258388.7500 EPtot = -699463.3500
>> >> > > > > >> Cellulose_production_NPT/3/mdout:* *Etot =
>> -441074.6000
>> >> EKtot
>> >> > > =
>> >> > > > > >> 258388.7500 EPtot = -699463.3500
>> >> > > > > >> Cellulose_production_NPT/4/mdout: Etot = -441074.6000
>> >> EKtot
>> >> > =
>> >> > > > > >> 258388.7500 EPtot = -699463.3500
>> >> > > > > >>
>> >> > > > > >>
>> >> > > > > >> Cellulose_production_NVE/1/mdout: Etot = -443246.3519
>> >> EKtot
>> >> > =
>> >> > > > > >> 258074.3125 EPtot = -701320.6644
>> >> > > > > >> Cellulose_production_NVE/2/mdout: Etot = -443246.3519
>> >> EKtot
>> >> > =
>> >> > > > > >> 258074.3125 EPtot = -701320.6644
>> >> > > > > >> Cellulose_production_NVE/3/mdout: Etot = -443246.3519
>> >> EKtot
>> >> > =
>> >> > > > > >> 258074.3125 EPtot = -701320.6644
>> >> > > > > >> Cellulose_production_NVE/4/mdout: Etot = -443246.3519
>> >> EKtot
>> >> > =
>> >> > > > > >> 258074.3125 EPtot = -701320.6644
>> >> > > > > >>
>> >> > > > > >>
>> >> > > > > >>
>> >> > > > > >>
>> >> > > > > >> On Thu, Jun 6, 2013 at 10:29 AM, Marek Maly
>> >> <marek.maly.ujep.cz
>> >> >
>> >> > > > wrote:
>> >> > > > > >>
>> >> > > > > >> > OK, let us know your NUCLEOSOME results (this test will
>> >> take
>> >> > some
>> >> > > > time
>> >> > > > > >> > ...).
>> >> > > > > >> >
>> >> > > > > >> > M.
>> >> > > > > >> >
>> >> > > > > >> >
>> >> > > > > >> >
>> >> > > > > >> > Dne Thu, 06 Jun 2013 16:37:03 +0200 Jonathan Gough
>> >> > > > > >> > <jonathan.d.gough.gmail.com> napsal/-a:
>> >> > > > > >> >
>> >> > > > > >> > > I have the:
>> >> > > > > >> > > EVGA 06G-P4-2793-KR GeForce GTX TITAN SuperClocked
>> >> Signature
>> >> > 6GB
>> >> > > > > >> 384-bit
>> >> > > > > >> > > GDDR5 PCI Express 3.0 x16 HDCP, SLI Ready Video Card
>> >> > > > > >> > >
>> >> > > > > >> > > and the previously posted results were with bugfix 18.
>> >> > Checking
>> >> > > > GB
>> >> > > > > >> > > nucleosome now
>> >> > > > > >> > >
>> >> > > > > >> > >
>> >> > > > > >> > > On Thu, Jun 6, 2013 at 6:40 AM, Marek Maly <
>> >> > marek.maly.ujep.cz>
>> >> > > > > >> wrote:
>> >> > > > > >> > >
>> >> > > > > >> > >> Welcome in the club :))
>> >> > > > > >> > >>
>> >> > > > > >> > >> First of all do not panic. Scott recently identified
>> and
>> >> > > reported
>> >> > > > > >> > >> some cuFFT "bug" in connection with Titans and sent
>> it
>> >> to
>> >> > > NVIDIA,
>> >> > > > > >> > >> now we have to wait what the NVIDIA experts answer.
>> >> There
>> >> is
>> >> > > also
>> >> > > > > >> > >> another
>> >> > > > > >> > >> Amber/Titan issue
>> >> > > > > >> > >> which has some another origin (GB of big systems i.e.
>> >> > > NUCLEOSOME)
>> >> > > > > >> you
>> >> > > > > >> > >> may
>> >> > > > > >> > >> try it
>> >> > > > > >> > >> as well. Amber guys are working perhaps also on that.
>> >> > > > > >> > >>
>> >> > > > > >> > >> So on your place I would wait with RMAing unless you
>> >> have
>> >> any
>> >> > > > other
>> >> > > > > >> > >> indications
>> >> > > > > >> > >> that your GPU might me damaged. In between you may do
>> >> some
>> >> > > tests
>> >> > > > of
>> >> > > > > >> this
>> >> > > > > >> > >> GPU with memtestG80.
>> >> > > > > >> > >>
>> >> > > > > >> > >> here is the most recent version:
>> >> > > > > >> > >>
>> >> > > > > >> > >> ---
>> >> > > > > >> > >> memtestG80
>> >> > > > > >> > >> https://github.com/ihaque/memtestG80
>> >> > > > > >> > >> here is the sync fix code
>> >> > > > > >> > >>
>> >> > > > > >> > >>
>> >> > > > > >> >
>> >> > > > > >>
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> https://github.com/ihaque/memtestG80/commit/c4336a69fff07945c322d6c7fc40b0b12341cc4c
>> >> > > > > >> > >> ---
>> >> > > > > >> > >>
>> >> > > > > >> > >> BTW which Titan GPU are you using the stock one or
>> the
>> >> > > > superclocked
>> >> > > > > >> one
>> >> > > > > >> > >> ?
>> >> > > > > >> > >>
>> >> > > > > >> > >> Anyway I would recommend you to recompile Amber with
>> the
>> >> > > latests
>> >> > > > > >> > >> Amber 12 patch (bugfix 18) if you did not do it.
>> >> > > > > >> > >>
>> >> > > > > >> > >> M.
>> >> > > > > >> > >>
>> >> > > > > >> > >>
>> >> > > > > >> > >>
>> >> > > > > >> > >>
>> >> > > > > >> > >>
>> >> > > > > >> > >>
>> >> > > > > >> > >>
>> >> > > > > >> > >>
>> >> > > > > >> > >>
>> >> > > > > >> > >>
>> >> > > > > >> > >>
>> >> > > > > >> > >>
>> >> > > > > >> > >> Dne Thu, 06 Jun 2013 12:01:35 +0200 Jonathan Gough
>> >> > > > > >> > >> <jonathan.d.gough.gmail.com> napsal/-a:
>> >> > > > > >> > >>
>> >> > > > > >> > >> > Bad News.
>> >> > > > > >> > >> >
>> >> > > > > >> > >> > I ran each set of tests 4 times, nstlim=100000.
>> >> FactorIX
>> >> > was
>> >> > > > the
>> >> > > > > >> only
>> >> > > > > >> > >> one
>> >> > > > > >> > >> > that gave consistent results. Again I had a few
>> that
>> >> just
>> >> > > died
>> >> > > > > >> without
>> >> > > > > >> > >> > any
>> >> > > > > >> > >> > error messages.
>> >> > > > > >> > >> >
>> >> > > > > >> > >> > CentOs 6
>> >> > > > > >> > >> > gnu compilers
>> >> > > > > >> > >> > Cuda 5.0 and Driver Version: 319.23
>> >> > > > > >> > >> > AmberTools version 13.09
>> >> > > > > >> > >> > Amber version 12.18
>> >> > > > > >> > >> >
>> >> > > > > >> > >> > Cellulose_production_NVE/1/mdout: Etot =
>> >> > -443246.3206EKtot
>> >> > > > > >> =
>> >> > > > > >> > >> > 258074.3438 EPtot = -701320.6644
>> >> > > > > >> > >> > Cellulose_production_NVE/2/mdout Died at 4000
>> steps
>> >> - no
>> >> > > error
>> >> > > > > >> > >> message.
>> >> > > > > >> > >> > Cellulose_production_NVE/3/mdout: Etot =
>> >> -443238.0345
>> >> > > > > >> EKtot =
>> >> > > > > >> > >> > 257651.0625 EPtot = -700889.0970
>> >> > > > > >> > >> > Cellulose_production_NVE/4/mdout: Etot =
>> >> -443246.3206
>> >> > > > > >> EKtot =
>> >> > > > > >> > >> > 258074.3438 EPtot = -701320.6644
>> >> > > > > >> > >> >
>> >> > > > > >> > >> > Cellulose_production_NPT/1/mdout: Etot =
>> >> -441009.1612
>> >> > > > > >> EKtot =
>> >> > > > > >> > >> > 257571.2031 EPtot = -698580.3643
>> >> > > > > >> > >> > Cellulose_production_NPT/2/mdout: Etot =
>> >> -440947.3717
>> >> > > > > >> EKtot =
>> >> > > > > >> > >> > 257723.3750 EPtot = -698670.7467
>> >> > > > > >> > >> > Cellulose_production_NPT/3/mdout: Etot =
>> >> -441024.3259
>> >> > > > > >> EKtot =
>> >> > > > > >> > >> > 257406.5781 EPtot = -698430.9041
>> >> > > > > >> > >> > Cellulose_production_NPT/4/mdout: Etot =
>> >> -440970.6005
>> >> > > > > >> EKtot =
>> >> > > > > >> > >> > 257756.1250 EPtot = -698726.7255
>> >> > > > > >> > >> >
>> >> > > > > >> > >> > FactorIX_production_NVE/1/mdout: Etot =
>> >> -234189.5802
>> >> > > > > >> EKtot =
>> >> > > > > >> > >> > 54845.8359 EPtot = -289035.4162
>> >> > > > > >> > >> > FactorIX_production_NVE/2/mdout: Etot =
>> >> -234189.5802
>> >> > > > > >> EKtot =
>> >> > > > > >> > >> > 54845.8359 EPtot = -289035.4162
>> >> > > > > >> > >> > FactorIX_production_NVE/3/mdout: Etot =
>> >> -234189.5802
>> >> > > > > >> EKtot =
>> >> > > > > >> > >> > 54845.8359 EPtot = -289035.4162
>> >> > > > > >> > >> > FactorIX_production_NVE/4/mdout: Etot =
>> >> -234189.5802
>> >> > > > > >> EKtot =
>> >> > > > > >> > >> > 54845.8359 EPtot = -289035.4162
>> >> > > > > >> > >> >
>> >> > > > > >> > >> > FactorIX_production_NPT/1/mdout: Etot =
>> >> -234493.4304
>> >> > > > > >> EKtot =
>> >> > > > > >> > >> > 55062.0156 EPtot = -289555.4460
>> >> > > > > >> > >> > FactorIX_production_NPT/2/mdout: Etot =
>> >> -234493.4304
>> >> > > > > >> EKtot =
>> >> > > > > >> > >> > 55062.0156 EPtot = -289555.4460
>> >> > > > > >> > >> > FactorIX_production_NPT/3/mdout: Etot =
>> >> -234493.4304
>> >> > > > > >> EKtot =
>> >> > > > > >> > >> > 55062.0156 EPtot = -289555.4460
>> >> > > > > >> > >> > FactorIX_production_NPT/4/mdout: Etot =
>> >> -234493.4304
>> >> > > > > >> EKtot =
>> >> > > > > >> > >> > 55062.0156 EPtot = -289555.4460
>> >> > > > > >> > >> >
>> >> > > > > >> > >> > JAC_production_NVE/1/mdout: Etot = -58141.0647
>> >> EKtot
>> >> > > =
>> >> > > > > >> > >> > 14347.6699 EPtot = -72488.7346
>> >> > > > > >> > >> > JAC_production_NVE/2/mdout: Etot = -58141.4961
>> >> EKtot
>> >> > > =
>> >> > > > > >> > >> > 14320.1465 EPtot = -72461.6425
>> >> > > > > >> > >> > JAC_production_NVE/3/mdout: Died at 48000 steps
>> >> > > > > >> > >> > JAC_production_NVE/4/mdout: Etot = -58141.6938
>> >> EKtot
>> >> > > =
>> >> > > > > >> > >> > 14257.2305 EPtot = -72398.9243
>> >> > > > > >> > >> >
>> >> > > > > >> > >> > JAC_production_NPT/1/mdout: Died at 78000 steps
>> >> > > > > >> > >> > JAC_production_NPT/2/mdout: Etot = -58206.6103
>> >> EKtot
>> >> > > =
>> >> > > > > >> > >> > 14384.7959 EPtot = -72591.4062
>> >> > > > > >> > >> > JAC_production_NPT/3/mdout: Etot = -58211.2469
>> >> EKtot
>> >> > > =
>> >> > > > > >> > >> > 14454.1592 EPtot = -72665.4061
>> >> > > > > >> > >> > JAC_production_NPT/1/mdout: Died at 89000 steps
>> >> > > > > >> > >> >
>> >> > > > > >> > >> >
>> >> > > > > >> > >> > Any recommendations on what to do? Send the card
>> back?
>> >> > Update
>> >> > > > > >> drivers?
>> >> > > > > >> > >> > Update Cuda?
>> >> > > > > >> > >> >
>> >> > > > > >> > >> >
>> >> > > > > >> > >> >
>> >> > > > > >> > >> >
>> >> > > > > >> > >> > On Wed, Jun 5, 2013 at 6:45 PM, Marek Maly <
>> >> > > marek.maly.ujep.cz
>> >> > > > >
>> >> > > > > >> > wrote:
>> >> > > > > >> > >> >
>> >> > > > > >> > >> >> Yes you got it,
>> >> > > > > >> > >> >>
>> >> > > > > >> > >> >> one more thing. Check carefully the benchmark mdin
>> >> files
>> >> > and
>> >> > > > > >> > >> >> if you see there "ig=-1" just delete this, to
>> ensure,
>> >> that
>> >> > > > > >> > >> >> both runs of the given test will run using the
>> same
>> >> random
>> >> > > > seed.
>> >> > > > > >> > >> >>
>> >> > > > > >> > >> >> (As I remember I found it just in one or two
>> tests,
>> >> don't
>> >> > > > > >> remember
>> >> > > > > >> > >> which
>> >> > > > > >> > >> >> one).
>> >> > > > > >> > >> >>
>> >> > > > > >> > >> >> Let us know your results i.e. if all the tests
>> (JAC
>> >> > NVE/NPT,
>> >> > > > > >> > >> FACTOR_IX
>> >> > > > > >> > >> >> NVE/NPT etc.)
>> >> > > > > >> > >> >> successfully finished all 100K steps (in both
>> runs)
>> >> and
>> >> if
>> >> > > > > >> moreover
>> >> > > > > >> > >> the
>> >> > > > > >> > >> >> results from both runs
>> >> > > > > >> > >> >> are identical (just check the final energy).
>> >> > > > > >> > >> >>
>> >> > > > > >> > >> >> In case of any error (writen in mdout file or in
>> >> standard
>> >> > > > output
>> >> > > > > >> > >> (screen
>> >> > > > > >> > >> >> or nohup.out ...) ), please report it here as
>> well.
>> >> > > > > >> > >> >>
>> >> > > > > >> > >> >> Thanks,
>> >> > > > > >> > >> >>
>> >> > > > > >> > >> >> M.
>> >> > > > > >> > >> >>
>> >> > > > > >> > >> >>
>> >> > > > > >> > >> >>
>> >> > > > > >> > >> >>
>> >> > > > > >> > >> >>
>> >> > > > > >> > >> >> Dne Thu, 06 Jun 2013 00:34:39 +0200 Jonathan Gough
>> >> > > > > >> > >> >> <jonathan.d.gough.gmail.com> napsal/-a:
>> >> > > > > >> > >> >>
>> >> > > > > >> > >> >> > I know I'm late in the game, but I have been
>> >> reading
>> >> > some
>> >> > > of
>> >> > > > > >> these
>> >> > > > > >> > >> two
>> >> > > > > >> > >> >> > Titan threads. I'm now attempting to test my 1
>> >> Titan
>> >> > card
>> >> > > > > >> and I
>> >> > > > > >> > >> want
>> >> > > > > >> > >> >> to
>> >> > > > > >> > >> >> > make sure I understand what I aught to be doing.
>> >> > > > > >> > >> >> >
>> >> > > > > >> > >> >> > Download the Amber_GPU_Benchmark_Suite
>> >> > > > > >> > >> >> > in mdin, change nstlim=100000
>> >> > > > > >> > >> >> > and then run the 6 benchmarks at least 2 times
>> each
>> >> > > > > >> > >> >> >
>> >> > > > > >> > >> >> > yes?
>> >> > > > > >> > >> >> >
>> >> > > > > >> > >> >> > The issue that we have had is that simulations
>> >> would
>> >> > just
>> >> > > > > >> > >> prematurely
>> >> > > > > >> > >> >> > stop.
>> >> > > > > >> > >> >> > We didn't see any error messages in the mdout
>> file
>> >> > though,
>> >> > > > > >> they
>> >> > > > > >> > >> just
>> >> > > > > >> > >> >> > stopped.
>> >> > > > > >> > >> >> >
>> >> > > > > >> > >> >> > Were using Cuda 5.0 and Driver Version: 319.23
>> >> > > > > >> > >> >> >
>> >> > > > > >> > >> >> >
>> >> > > > > >> > >> >> >
>> >> > > > > >> > >> >> > On Wed, Jun 5, 2013 at 1:29 PM, Marek Maly
>> >> > > > > >> <marek.maly.ujep.cz>
>> >> > > > > >> > >> wrote:
>> >> > > > > >> > >> >> >
>> >> > > > > >> > >> >> >> Hi Scott,
>> >> > > > > >> > >> >> >>
>> >> > > > > >> > >> >> >> thanks for update ! Let's see what will be
>> >> reaction
>> >> > from
>> >> > > > > >> NVIDIA.
>> >> > > > > >> > >> >> >> In the worst case let's hope that also some
>> other
>> >> > > > > >> (NON-NVIDIA)
>> >> > > > > >> > >> "GPU
>> >> > > > > >> > >> >> FFT
>> >> > > > > >> > >> >> >> library"
>> >> > > > > >> > >> >> >> alternatives exists (to be compiled/used
>> >> alternatively
>> >> > > with
>> >> > > > > >> > >> >> pmemd.cuda)
>> >> > > > > >> > >> >> >>
>> >> > > > > >> > >> >> >> BTW I just found this perhaps interesting
>> article
>> >> (I
>> >> > only
>> >> > > > > >> list
>> >> > > > > >> the
>> >> > > > > >> > >> >> >> supplementary part. ):
>> >> > > > > >> > >> >> >>
>> >> > > > > >> > >> >> >>
>> >> > > > > >>
>> >> http://www.computer.org/csdl/trans/td/preprint/06470608-abs.html
>> >> > > > > >> > >> >> >>
>> >> > > > > >> > >> >> >> OK, meanwhile I finished my experiment/tests
>> with
>> >> > > swapping
>> >> > > > my
>> >> > > > > >> two
>> >> > > > > >> > >> >> titans
>> >> > > > > >> > >> >> >> in slots. As you can see below it did not solve
>> >> the
>> >> > > > problems
>> >> > > > > >> on
>> >> > > > > >> my
>> >> > > > > >> > >> >> >> "less stable" titan, but on the other hand
>> there
>> >> is
>> >> > > > > >> significant
>> >> > > > > >> > >> >> >> improvement.
>> >> > > > > >> > >> >> >> I will now try with just "my less stable" GPU
>> >> plugged
>> >> > on
>> >> > > > > >> > >> >> motherboard to
>> >> > > > > >> > >> >> >> eventually confirm that it's less stability has
>> >> origin
>> >> > in
>> >> > > > > >> it's
>> >> > > > > >> > >> higher
>> >> > > > > >> > >> >> >> sensitivity
>> >> > > > > >> > >> >> >> to dual GPU configuration (OR just to dual GPU
>> >> config
>> >> > > with
>> >> > > > > >> another
>> >> > > > > >> > >> >> Titan
>> >> > > > > >> > >> >> >> maybe that
>> >> > > > > >> > >> >> >> with GTX 580/680 it will be OK or at least
>> better
>> >> than
>> >> > > > with 2
>> >> > > > > >> > >> >> Titans).
>> >> > > > > >> > >> >> >>
>> >> > > > > >> > >> >> >> M.
>> >> > > > > >> > >> >> >>
>> >> > > > > >> > >> >> >>
>> >> > > > > >> > >> >> >> SIMULTANEOUS TEST (BOTH GPUS) running at the
>> same
>> >> time
>> >> > > > > >> > >> >> >>
>> >> > > > > >> > >> >> >> density (100K steps, NPT, restrained solute)
>> >> > > > > >> > >> >> >> prod1 and prod2 (250K steps, NPT)
>> >> > > > > >> > >> >> >>
>> >> > > > > >> > >> >> >> TITAN_0, TITAN_1 now rather identify PCI slots
>> >> than
>> >> > given
>> >> > > > > >> cards.
>> >> > > > > >> > >> >> >>
>> >> > > > > >> > >> >> >> all the errs I have obtained here is here just:
>> >> > > > > >> > >> >> >>
>> >> > > > > >> > >> >> >> -----
>> >> > > > > >> > >> >> >> cudaMemcpy GpuBuffer::Download failed
>> unspecified
>> >> > launch
>> >> > > > > >> failure
>> >> > > > > >> > >> >> >> -----
>> >> > > > > >> > >> >> >>
>> >> > > > > >> > >> >> >> #1 ORIGINAL CONFIGURATION
>> >> > > > > >> > >> >> >>
>> >> > > > > >> > >> >> >> density prod1 prod2
>> >> > > > > >> > >> >> >>
>> >> > > > > >> > >> >> >> TITAN_0
>> >> > > > > >> > >> >> >> -297755.2479 -299267.1086 65K
>> >> > > > > >> > >> >> >> 20K -299411.2631 100K
>> >> > > > > >> > >> >> >>
>> >> > > > > >> > >> >> >> TITAN_1
>> >> > > > > >> > >> >> >> -297906.5447 -298657.3725 -298683.8965
>> >> > > > > >> > >> >> >> -297906.5447 -298657.3725 -298683.8965
>> >> > > > > >> > >> >> >>
>> >> > > > > >> > >> >> >>
>> >> > > > > >> > >> >> >>
>> >> > > > > >> > >> >> >>
>> >> > > > > >> > >> >> >> #2 AFTER GPU SWAPPING (respect to PCI slots)
>> >> > > > > >> > >> >> >>
>> >> > > > > >> > >> >> >> density prod1 prod2
>> >> > > > > >> > >> >> >>
>> >> > > > > >> > >> >> >> TITAN_0 (so these are results of the GPU named
>> >> before
>> >> > as
>> >> > > > > >> TITAN_1)
>> >> > > > > >> > >> >> >> -297906.5447 -298657.3725 -298683.8965
>> >> > > > > >> > >> >> >> -297906.5447 -298657.3725 -298683.8965
>> >> > > > > >> > >> >> >>
>> >> > > > > >> > >> >> >> TITAN_1 (so these are results of the GPU named
>> >> before
>> >> > as
>> >> > > > > >> TITAN_0)
>> >> > > > > >> > >> >> >> -297906.5447 240K -298764.5294
>> >> > > > > >> > >> >> >> -297752.2836 -298997.8891 -299610.3812
>> >> > > > > >> > >> >> >>
>> >> > > > > >> > >> >> >>
>> >> > > > > >> > >> >> >>
>> >> > > > > >> > >> >> >>
>> >> > > > > >> > >> >> >>
>> >> > > > > >> > >> >> >>
>> >> > > > > >> > >> >> >>
>> >> > > > > >> > >> >> >> Dne Wed, 05 Jun 2013 18:15:48 +0200 Scott Le
>> Grand
>> >> > > > > >> > >> >> >> <varelse2005.gmail.com>
>> >> > > > > >> > >> >> >> napsal/-a:
>> >> > > > > >> > >> >> >>
>> >> > > > > >> > >> >> >> > Filip,
>> >> > > > > >> > >> >> >> > What's happening on Titan can take a while to
>> >> > > trigger. I
>> >> > > > > >> have
>> >> > > > > >> > >> >> >> delivered
>> >> > > > > >> > >> >> >> > a
>> >> > > > > >> > >> >> >> > repro to NVIDIA that shows exactly what's
>> >> happening
>> >> > but
>> >> > > > > >> it's
>> >> > > > > >> up
>> >> > > > > >> > >> to
>> >> > > > > >> > >> >> >> them
>> >> > > > > >> > >> >> >> > to
>> >> > > > > >> > >> >> >> > explain why because its occurring inside
>> cuFFT.
>> >> > That's
>> >> > > > why
>> >> > > > > >> you
>> >> > > > > >> > >> >> need
>> >> > > > > >> > >> >> >> to
>> >> > > > > >> > >> >> >> > run
>> >> > > > > >> > >> >> >> > at least 100K iterations to see a single
>> >> occurrence.
>> >> > > > > >> > >> >> >> >
>> >> > > > > >> > >> >> >> > There's a second issue that's happening with
>> >> large
>> >> GB
>> >> > > > > >> > >> simulations,
>> >> > > > > >> > >> >> but
>> >> > > > > >> > >> >> >> > that
>> >> > > > > >> > >> >> >> > one is even harder to trap. That doesn't
>> mean
>> >> it
>> >> > isn't
>> >> > > > > >> > >> happening,
>> >> > > > > >> > >> >> >> just
>> >> > > > > >> > >> >> >> > that it's on the very edge of doing so on
>> Titan.
>> >> > > > > >> > >> >> >> >
>> >> > > > > >> > >> >> >> > Thankfully, I have not been able to trigger
>> >> either
>> >> > bug
>> >> > > on
>> >> > > > > >> GK104
>> >> > > > > >> > >> or
>> >> > > > > >> > >> >> >> K20...
>> >> > > > > >> > >> >> >> >
>> _______________________________________________
>> >> > > > > >> > >> >> >> > AMBER mailing list
>> >> > > > > >> > >> >> >> > AMBER.ambermd.org
>> >> > > > > >> > >> >> >> >
>> http://lists.ambermd.org/mailman/listinfo/amber
>> >> > > > > >> > >> >> >> >
>> >> > > > > >> > >> >> >> > __________ Informace od ESET NOD32 Antivirus,
>> >> verze
>> >> > > > > >> databaze
>> >> > > > > >> > >> 8415
>> >> > > > > >> > >> >> >> > (20130605) __________
>> >> > > > > >> > >> >> >> >
>> >> > > > > >> > >> >> >> > Tuto zpravu proveril ESET NOD32 Antivirus.
>> >> > > > > >> > >> >> >> >
>> >> > > > > >> > >> >> >> > http://www.eset.cz
>> >> > > > > >> > >> >> >> >
>> >> > > > > >> > >> >> >> >
>> >> > > > > >> > >> >> >> >
>> >> > > > > >> > >> >> >>
>> >> > > > > >> > >> >> >>
>> >> > > > > >> > >> >> >> --
>> >> > > > > >> > >> >> >> Tato zpráva byla vytvořena převratným poštovním
>> >> > klientem
>> >> > > > > >> Opery:
>> >> > > > > >> > >> >> >> http://www.opera.com/mail/
>> >> > > > > >> > >> >> >>
>> >> > > > > >> > >> >> >> _______________________________________________
>> >> > > > > >> > >> >> >> AMBER mailing list
>> >> > > > > >> > >> >> >> AMBER.ambermd.org
>> >> > > > > >> > >> >> >> http://lists.ambermd.org/mailman/listinfo/amber
>> >> > > > > >> > >> >> >>
>> >> > > > > >> > >> >> > _______________________________________________
>> >> > > > > >> > >> >> > AMBER mailing list
>> >> > > > > >> > >> >> > AMBER.ambermd.org
>> >> > > > > >> > >> >> > http://lists.ambermd.org/mailman/listinfo/amber
>> >> > > > > >> > >> >> >
>> >> > > > > >> > >> >> > __________ Informace od ESET NOD32 Antivirus,
>> verze
>> >> > > databaze
>> >> > > > > >> 8416
>> >> > > > > >> > >> >> > (20130605) __________
>> >> > > > > >> > >> >> >
>> >> > > > > >> > >> >> > Tuto zpravu proveril ESET NOD32 Antivirus.
>> >> > > > > >> > >> >> >
>> >> > > > > >> > >> >> > http://www.eset.cz
>> >> > > > > >> > >> >> >
>> >> > > > > >> > >> >> >
>> >> > > > > >> > >> >> >
>> >> > > > > >> > >> >>
>> >> > > > > >> > >> >>
>> >> > > > > >> > >> >> --
>> >> > > > > >> > >> >> Tato zpráva byla vytvořena převratným poštovním
>> >> klientem
>> >> > > > Opery:
>> >> > > > > >> > >> >> http://www.opera.com/mail/
>> >> > > > > >> > >> >>
>> >> > > > > >> > >> >> _______________________________________________
>> >> > > > > >> > >> >> AMBER mailing list
>> >> > > > > >> > >> >> AMBER.ambermd.org
>> >> > > > > >> > >> >> http://lists.ambermd.org/mailman/listinfo/amber
>> >> > > > > >> > >> >>
>> >> > > > > >> > >> > _______________________________________________
>> >> > > > > >> > >> > AMBER mailing list
>> >> > > > > >> > >> > AMBER.ambermd.org
>> >> > > > > >> > >> > http://lists.ambermd.org/mailman/listinfo/amber
>> >> > > > > >> > >> >
>> >> > > > > >> > >> > __________ Informace od ESET NOD32 Antivirus, verze
>> >> > databaze
>> >> > > > 8417
>> >> > > > > >> > >> > (20130606) __________
>> >> > > > > >> > >> >
>> >> > > > > >> > >> > Tuto zpravu proveril ESET NOD32 Antivirus.
>> >> > > > > >> > >> >
>> >> > > > > >> > >> > http://www.eset.cz
>> >> > > > > >> > >> >
>> >> > > > > >> > >> >
>> >> > > > > >> > >> >
>> >> > > > > >> > >>
>> >> > > > > >> > >>
>> >> > > > > >> > >> --
>> >> > > > > >> > >> Tato zpráva byla vytvořena převratným poštovním
>> klientem
>> >> > Opery:
>> >> > > > > >> > >> http://www.opera.com/mail/
>> >> > > > > >> > >>
>> >> > > > > >> > >> _______________________________________________
>> >> > > > > >> > >> AMBER mailing list
>> >> > > > > >> > >> AMBER.ambermd.org
>> >> > > > > >> > >> http://lists.ambermd.org/mailman/listinfo/amber
>> >> > > > > >> > >>
>> >> > > > > >> > > _______________________________________________
>> >> > > > > >> > > AMBER mailing list
>> >> > > > > >> > > AMBER.ambermd.org
>> >> > > > > >> > > http://lists.ambermd.org/mailman/listinfo/amber
>> >> > > > > >> > >
>> >> > > > > >> > > __________ Informace od ESET NOD32 Antivirus, verze
>> >> databaze
>> >> > > 8418
>> >> > > > > >> > > (20130606) __________
>> >> > > > > >> > >
>> >> > > > > >> > > Tuto zpravu proveril ESET NOD32 Antivirus.
>> >> > > > > >> > >
>> >> > > > > >> > > http://www.eset.cz
>> >> > > > > >> > >
>> >> > > > > >> > >
>> >> > > > > >> > >
>> >> > > > > >> >
>> >> > > > > >> >
>> >> > > > > >> > --
>> >> > > > > >> > Tato zpráva byla vytvořena převratným poštovním klientem
>> >> Opery:
>> >> > > > > >> > http://www.opera.com/mail/
>> >> > > > > >> >
>> >> > > > > >> > _______________________________________________
>> >> > > > > >> > AMBER mailing list
>> >> > > > > >> > AMBER.ambermd.org
>> >> > > > > >> > http://lists.ambermd.org/mailman/listinfo/amber
>> >> > > > > >> >
>> >> > > > > >> _______________________________________________
>> >> > > > > >> AMBER mailing list
>> >> > > > > >> AMBER.ambermd.org
>> >> > > > > >> http://lists.ambermd.org/mailman/listinfo/amber
>> >> > > > > >>
>> >> > > > > > _______________________________________________
>> >> > > > > > AMBER mailing list
>> >> > > > > > AMBER.ambermd.org
>> >> > > > > > http://lists.ambermd.org/mailman/listinfo/amber
>> >> > > > > >
>> >> > > > > > __________ Informace od ESET NOD32 Antivirus, verze
>> databaze
>> >> 8423
>> >> > > > > > (20130607) __________
>> >> > > > > >
>> >> > > > > > Tuto zpravu proveril ESET NOD32 Antivirus.
>> >> > > > > >
>> >> > > > > > http://www.eset.cz
>> >> > > > > >
>> >> > > > > >
>> >> > > > > >
>> >> > > > >
>> >> > > > >
>> >> > > > > --
>> >> > > > > Tato zpráva byla vytvořena převratným poštovním klientem
>> Opery:
>> >> > > > > http://www.opera.com/mail/
>> >> > > > >
>> >> > > > > _______________________________________________
>> >> > > > > AMBER mailing list
>> >> > > > > AMBER.ambermd.org
>> >> > > > > http://lists.ambermd.org/mailman/listinfo/amber
>> >> > > > >
>> >> > > > _______________________________________________
>> >> > > > AMBER mailing list
>> >> > > > AMBER.ambermd.org
>> >> > > > http://lists.ambermd.org/mailman/listinfo/amber
>> >> > > >
>> >> > > _______________________________________________
>> >> > > AMBER mailing list
>> >> > > AMBER.ambermd.org
>> >> > > http://lists.ambermd.org/mailman/listinfo/amber
>> >> > > _______________________________________________
>> >> > > AMBER mailing list
>> >> > > AMBER.ambermd.org
>> >> > > http://lists.ambermd.org/mailman/listinfo/amber
>> >> > >
>> >> > _______________________________________________
>> >> > AMBER mailing list
>> >> > AMBER.ambermd.org
>> >> > http://lists.ambermd.org/mailman/listinfo/amber
>> >> >
>> >> _______________________________________________
>> >> AMBER mailing list
>> >> AMBER.ambermd.org
>> >> http://lists.ambermd.org/mailman/listinfo/amber
>> >>
>> > _______________________________________________
>> > AMBER mailing list
>> > AMBER.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber
>> >
>> > __________ Informace od ESET NOD32 Antivirus, verze databaze 8434
>> > (20130610) __________
>> >
>> > Tuto zpravu proveril ESET NOD32 Antivirus.
>> >
>> > http://www.eset.cz
>> >
>> >
>> >
>>
>>
>> --
>> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>> http://www.opera.com/mail/
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
> __________ Informace od ESET NOD32 Antivirus, verze databaze 8441
> (20130612) __________
>
> Tuto zpravu proveril ESET NOD32 Antivirus.
>
> http://www.eset.cz
>
>
>


-- 
Tato zpráva byla vytvořena převratným poštovním klientem Opery:  
http://www.opera.com/mail/
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Jun 12 2013 - 12:00:03 PDT
Custom Search