Re: [AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ? from Scott Le Grand on 2013-05-30 (Amber Archive May 2013)

From: Scott Le Grand <varelse2005.gmail.com>
Date: Thu, 30 May 2013 15:45:49 -0700

It will be very interesting if this behavior persists after downclocking.

But right now, Titan 0 *looks* hosed and Titan 1 *looks* like it needs
downclocking...
On May 30, 2013 3:20 PM, "Marek Maly" <marek.maly.ujep.cz> wrote:

> Hi all,
>
> here are my results from the 500K steps 2 x repeated benchmarks
> under 319.23 driver and still Cuda 5.0 (see the attached table ).
>
> It is hard to say if the results are better or worse than in my
> previous 100K test under driver 319.17.
>
> While results from Cellulose test were improved and the TITAN_1 card even
> successfully finished all 500K steps moreover with exactly the same final
> energy !
> (TITAN_0 at least finished more than 100K steps and in RUN_01 even more
> than 400K steps)
> In JAC_NPT test no GPU was able to finish at least 100K steps and the
> results from JAC_NVE
> test are also not too much convincing. FACTOR_IX_NVE and FACTOR_IX_NPT
> were successfully
> finished with 100% reproducibility in FACTOR_IX_NPT case (on both cards)
> and almost
> 100% reproducibility in case of FACTOR_IX_NVE (again 100% in case of
> TITAN_1). TRPCAGE, MYOGLOBIN
> again finished without any problem with 100% reproducibility. NUCLEOSOME
> test was not done
> this time due to high time requirements. If you find in the table positive
> number finishing with
> K (which means "thousands") it means the last number of step written in
> mdout before crash.
> Below are all the 3 types of detected errs with relevant systems/rounds
> where the given err
> appeared.
>
> Now I will try just 100K tests under ETs favourite driver version 313.30
> :)) and then
> I will eventually try to experiment with cuda 5.5 which I already
> downloaded from the
> cuda zone ( I had to become cuda developer for this :)) ) BTW ET thanks
> for the frequency info !
> and I am still ( perhaps not alone :)) ) very curious about your 2 x
> repeated Amber benchmark tests with superclocked Titan. Indeed that I am
> very curious also about that Ross "hot" patch.
>
> M.
>
> ERRORS DETECTED DURING THE 500K steps tests with driver 319.23
>
> #1 ERR writtent in mdout:
> ------
> | ERROR: max pairlist cutoff must be less than unit cell max sphere
> radius!
> ------
>
> TITAN_0 ROUND_1 JAC_NPT (at least 5000 steps successfully done before
> crash)
> TITAN_0 ROUND_2 JAC_NPT (at least 8000 steps successfully done before
> crash)
>
>
> #2 no ERR writtent in mdout, ERR written in standard output (nohup.out)
>
> ----
> Error: unspecified launch failure launching kernel kNLSkinTest
> cudaFree GpuBuffer::Deallocate failed unspecified launch failure
> ----
>
> TITAN_0 ROUND_1 CELLULOSE_NVE (at least 437 000 steps successfully done
> before crash)
> TITAN_0 ROUND_2 JAC_NVE (at least 162 000 steps successfully done before
> crash)
> TITAN_0 ROUND_2 CELLULOSE_NVE (at least 117 000 steps successfully done
> before crash)
> TITAN_1 ROUND_1 JAC_NVE (at least 119 000 steps successfully done before
> crash)
> TITAN_1 ROUND_2 JAC_NVE (at least 43 000 steps successfully done before
> crash)
>
>
> #3 no ERR writtent in mdout, ERR written in standard output (nohup.out)
> ----
> cudaMemcpy GpuBuffer::Download failed unspecified launch failure
> ----
>
> TITAN_1 ROUND_1 JAC_NPT (at least 77 000 steps successfully done before
> crash)
> TITAN_1 ROUND_2 JAC_NPT (at least 58 000 steps successfully done before
> crash)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Dne Thu, 30 May 2013 21:27:17 +0200 Scott Le Grand <varelse2005.gmail.com>
> napsal/-a:
>
> Oops meant to send that to Jason...
>>
>> Anyway, before we all panic, we need to get K20's behavior analyzed here.
>> If it's deterministic, this truly is a hardware issue. If not, then it
>> gets interesting because 680 is deterministic as far as I can tell...
>> On May 30, 2013 12:24 PM, "Scott Le Grand" <varelse2005.gmail.com> wrote:
>>
>> If the errors are not deterministically triggered, they probably won't be
>>> fixed by the patch, alas...
>>> On May 30, 2013 12:15 PM, "Jason Swails" <jason.swails.gmail.com> wrote:
>>>
>>> Just a reminder to everyone based on what Ross said: there is a pending
>>>> patch to pmemd.cuda that will be coming out shortly (maybe even within
>>>> hours). It's entirely possible that several of these errors are fixed
>>>> by
>>>> this patch.
>>>>
>>>> All the best,
>>>> Jason
>>>>
>>>>
>>>> On Thu, May 30, 2013 at 2:46 PM, filip fratev <filipfratev.yahoo.com>
>>>> wrote:
>>>>
>>>> > I have observed the same crashes from time to time. I will run
>>>> cellulose
>>>> > nve for 100k and will past results here.
>>>> >
>>>> > All the best,
>>>> > Filip
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > ______________________________**__
>>>> > From: Scott Le Grand <varelse2005.gmail.com>
>>>> > To: AMBER Mailing List <amber.ambermd.org>
>>>> > Sent: Thursday, May 30, 2013 9:01 PM
>>>> > Subject: Re: [AMBER] experiences with EVGA GTX TITAN Superclocked -
>>>> > memtestG80 - UNDERclocking in Linux ?
>>>> >
>>>> >
>>>> > Run cellulose nve for 100k iterations twice . If the final energies
>>>> don't
>>>> > match, you have a hardware issue. No need to play with ntpr or any
>>>> other
>>>> > variable.
>>>> > On May 30, 2013 10:58 AM, <pavel.banas.upol.cz> wrote:
>>>> >
>>>> > >
>>>> > > Dear all,
>>>> > >
>>>> > > I would also like to share one of my experience with titan cards. We
>>>> have
>>>> > > one gtx titan card and with one system (~55k atoms, NVT, RNA+waters)
>>>> we
>>>> > run
>>>> > > into same troubles you are describing. I was also playing with ntpr
>>>> to
>>>> > > figure out what is going on, step by step. I understand that the
>>>> code
>>>> is
>>>> > > using different routines for calculation energies+forces or only
>>>> forces.
>>>> > > The
>>>> > > simulations of other systems are perfectly stable, running for days
>>>> and
>>>> > > weeks. Only that particular system systematically ends up with this
>>>> > error.
>>>> > >
>>>> > > However, there was one interesting issue. When I set ntpr=1, the
>>>> error
>>>> > > vanished (systematically in multiple runs) and the simulation was
>>>> able to
>>>> > > run for more than millions of steps (I was not let it running for
>>>> weeks
>>>> > as
>>>> > > in the meantime I shifted that simulation to other card - need data,
>>>> not
>>>> > > testing). All other setting of ntpr failed. As I read this
>>>> discussion, I
>>>> > > tried to set ene_avg_sampling=1 with some high value of ntpr (I
>>>> expected
>>>> > > that this will shift the code to permanently use the force+energies
>>>> part
>>>> > of
>>>> > > the code, similarly to ntpr=1), but the error occurred again.
>>>> > >
>>>> > > I know it is not very conclusive for finding out what is happening,
>>>> at
>>>> > > least
>>>> > > not for me. Do you have any idea, why ntpr=1 might help?
>>>> > >
>>>> > > best regards,
>>>> > >
>>>> > > Pavel
>>>> > >
>>>> > >
>>>> > >
>>>> > >
>>>> > >
>>>> > > --
>>>> > > Pavel Banáš
>>>> > > pavel.banas.upol.cz
>>>> > > Department of Physical Chemistry,
>>>> > > Palacky University Olomouc
>>>> > > Czech Republic
>>>> > >
>>>> > >
>>>> > >
>>>> > > ---------- Původní zpráva ----------
>>>> > > Od: Jason Swails <jason.swails.gmail.com>
>>>> > > Datum: 29. 5. 2013
>>>> > > Předmět: Re: [AMBER] experiences with EVGA GTX TITAN Superclocked -
>>>> > > memtestG
>>>> > > 80 - UNDERclocking in Linux ?
>>>> > >
>>>> > > "I'll answer a little bit:
>>>> > >
>>>> > > NTPR=10 Etot after 2000 steps
>>>> > > >
>>>> > > > -443256.6711
>>>> > > > -443256.6711
>>>> > > >
>>>> > > > NTPR=200 Etot after 2000 steps
>>>> > > >
>>>> > > > -443261.0705
>>>> > > > -443261.0705
>>>> > > >
>>>> > > > Any idea why energies should depend on frequency of energy records
>>>> > (NTPR)
>>>> > > ?
>>>> > > >
>>>> > >
>>>> > > It is a subtle point, but the answer is 'different code paths.' In
>>>> > > general, it is NEVER necessary to compute the actual energy of a
>>>> molecule
>>>> > > during the course of standard molecular dynamics (by analogy, it is
>>>> NEVER
>>>> > > necessary to compute atomic forces during the course of random Monte
>>>> > Carlo
>>>> > > sampling).
>>>> > >
>>>> > > For performance's sake, then, pmemd.cuda computes only the force
>>>> when
>>>> > > energies are not requested, leading to a different order of
>>>> operations
>>>> > for
>>>> > > those runs. This difference ultimately causes divergence.
>>>> > >
>>>> > > To test this, try setting the variable ene_avg_sampling=10 in the
>>>> &cntrl
>>>> > > section. This will force pmemd.cuda to compute energies every 10
>>>> steps
>>>> > > (for energy averaging), which will in turn make the followed code
>>>> path
>>>> > > identical for any multiple-of-10 value of ntpr.
>>>> > >
>>>> > > --
>>>> > > Jason M. Swails
>>>> > > Quantum Theory Project,
>>>> > > University of Florida
>>>> > > Ph.D. Candidate
>>>> > > 352-392-4032
>>>> > > ______________________________**_________________
>>>> > > AMBER mailing list
>>>> > > AMBER.ambermd.org
>>>> > > http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>> "
>>>> > > ______________________________**_________________
>>>> > > AMBER mailing list
>>>> > > AMBER.ambermd.org
>>>> > > http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>> > >
>>>> > ______________________________**_________________
>>>> > AMBER mailing list
>>>> > AMBER.ambermd.org
>>>> > http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>> > ______________________________**_________________
>>>> > AMBER mailing list
>>>> > AMBER.ambermd.org
>>>> > http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Jason M. Swails
>>>> Quantum Theory Project,
>>>> University of Florida
>>>> Ph.D. Candidate
>>>> 352-392-4032
>>>> ______________________________**_________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org
>>>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>
>>>>
>>> ______________________________**_________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>
>> __________ Informace od ESET NOD32 Antivirus, verze databaze 8394
>> (20130530) __________
>>
>> Tuto zpravu proveril ESET NOD32 Antivirus.
>>
>> http://www.eset.cz
>>
>>
>>
>>
>
> --
> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
> http://www.opera.com/mail/
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu May 30 2013 - 16:00:03 PDT