Re: [AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ?

From: Marek Maly <marek.maly.ujep.cz> Date: Fri, 31 May 2013 00:00:21 +0200

----
Error: unspecified launch failure launching kernel kNLSkinTest
cudaFree GpuBuffer::Deallocate failed unspecified launch failure
----
TITAN_0 ROUND_1 CELLULOSE_NVE (at least 437 000 steps successfully done  
before crash)
TITAN_0 ROUND_2 JAC_NVE  (at least 162 000 steps successfully done before  
crash)
TITAN_0 ROUND_2 CELLULOSE_NVE (at least 117 000 steps successfully done  
before crash)
TITAN_1 ROUND_1 JAC_NVE  (at least 119 000 steps successfully done before  
crash)
TITAN_1 ROUND_2 JAC_NVE  (at least 43 000 steps successfully done before  
crash)
#3 no ERR writtent in mdout, ERR written in standard output (nohup.out)
----
cudaMemcpy GpuBuffer::Download failed unspecified launch failure
----
TITAN_1 ROUND_1 JAC_NPT  (at least 77 000 steps successfully done before  
crash)
TITAN_1 ROUND_2 JAC_NPT  (at least 58 000 steps successfully done before  
crash)
Dne Thu, 30 May 2013 21:27:17 +0200 Scott Le Grand <varelse2005.gmail.com>  
napsal/-a:
> Oops meant to send that to Jason...
>
> Anyway, before we all panic, we need to get K20's behavior analyzed here.
> If it's deterministic, this truly is a hardware issue.  If not, then it
> gets interesting because 680 is deterministic as far as I can tell...
> On May 30, 2013 12:24 PM, "Scott Le Grand" <varelse2005.gmail.com> wrote:
>
>> If the errors are not deterministically triggered, they probably won't  
>> be
>> fixed by the patch, alas...
>> On May 30, 2013 12:15 PM, "Jason Swails" <jason.swails.gmail.com> wrote:
>>
>>> Just a reminder to everyone based on what Ross said: there is a pending
>>> patch to pmemd.cuda that will be coming out shortly (maybe even within
>>> hours).  It's entirely possible that several of these errors are fixed  
>>> by
>>> this patch.
>>>
>>> All the best,
>>> Jason
>>>
>>>
>>> On Thu, May 30, 2013 at 2:46 PM, filip fratev <filipfratev.yahoo.com>
>>> wrote:
>>>
>>> > I have observed the same crashes from time to time. I will run
>>>  cellulose
>>> > nve for 100k and will past results here.
>>> >
>>> > All the best,
>>> > Filip
>>> >
>>> >
>>> >
>>> >
>>> > ________________________________
>>> >  From: Scott Le Grand <varelse2005.gmail.com>
>>> > To: AMBER Mailing List <amber.ambermd.org>
>>> > Sent: Thursday, May 30, 2013 9:01 PM
>>> > Subject: Re: [AMBER] experiences with EVGA GTX TITAN Superclocked -
>>> > memtestG80 - UNDERclocking in Linux ?
>>> >
>>> >
>>> > Run cellulose nve for 100k iterations twice .  If the final energies
>>> don't
>>> > match, you have a hardware issue.  No need to play with ntpr or any
>>> other
>>> > variable.
>>> > On May 30, 2013 10:58 AM, <pavel.banas.upol.cz> wrote:
>>> >
>>> > >
>>> > > Dear all,
>>> > >
>>> > > I would also like to share one of my experience with titan cards.  
>>> We
>>> have
>>> > > one gtx titan card and with one system (~55k atoms, NVT,  
>>> RNA+waters)
>>> we
>>> > run
>>> > > into same troubles you are describing. I was also playing with  
>>> ntpr to
>>> > > figure out what is going on, step by step. I understand that the  
>>> code
>>> is
>>> > > using different routines for calculation energies+forces or only
>>> forces.
>>> > > The
>>> > > simulations of other systems are perfectly stable, running for days
>>> and
>>> > > weeks. Only that particular system systematically ends up with this
>>> > error.
>>> > >
>>> > > However, there was one interesting issue. When I set ntpr=1, the  
>>> error
>>> > > vanished (systematically in multiple runs) and the simulation was
>>> able to
>>> > > run for more than millions of steps (I was not let it running for
>>> weeks
>>> > as
>>> > > in the meantime I shifted that simulation to other card - need  
>>> data,
>>> not
>>> > > testing). All other setting of ntpr failed. As I read this
>>> discussion, I
>>> > > tried to set ene_avg_sampling=1 with some high value of ntpr (I
>>> expected
>>> > > that this will shift the code to permanently use the force+energies
>>> part
>>> > of
>>> > > the code, similarly to ntpr=1), but the error occurred again.
>>> > >
>>> > > I know it is not very conclusive for finding out what is  
>>> happening, at
>>> > > least
>>> > > not for me. Do you have any idea, why ntpr=1 might help?
>>> > >
>>> > > best regards,
>>> > >
>>> > > Pavel
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > --
>>> > > Pavel Banáš
>>> > > pavel.banas.upol.cz
>>> > > Department of Physical Chemistry,
>>> > > Palacky University Olomouc
>>> > > Czech Republic
>>> > >
>>> > >
>>> > >
>>> > > ---------- Původní zpráva ----------
>>> > > Od: Jason Swails <jason.swails.gmail.com>
>>> > > Datum: 29. 5. 2013
>>> > > Předmět: Re: [AMBER] experiences with EVGA GTX TITAN Superclocked -
>>> > > memtestG
>>> > > 80 - UNDERclocking in Linux ?
>>> > >
>>> > > "I'll answer a little bit:
>>> > >
>>> > > NTPR=10 Etot after 2000 steps
>>> > > >
>>> > > > -443256.6711
>>> > > > -443256.6711
>>> > > >
>>> > > > NTPR=200 Etot after 2000 steps
>>> > > >
>>> > > > -443261.0705
>>> > > > -443261.0705
>>> > > >
>>> > > > Any idea why energies should depend on frequency of energy  
>>> records
>>> > (NTPR)
>>> > > ?
>>> > > >
>>> > >
>>> > > It is a subtle point, but the answer is 'different code paths.' In
>>> > > general, it is NEVER necessary to compute the actual energy of a
>>> molecule
>>> > > during the course of standard molecular dynamics (by analogy, it is
>>> NEVER
>>> > > necessary to compute atomic forces during the course of random  
>>> Monte
>>> > Carlo
>>> > > sampling).
>>> > >
>>> > > For performance's sake, then, pmemd.cuda computes only the force  
>>> when
>>> > > energies are not requested, leading to a different order of  
>>> operations
>>> > for
>>> > > those runs. This difference ultimately causes divergence.
>>> > >
>>> > > To test this, try setting the variable ene_avg_sampling=10 in the
>>> &cntrl
>>> > > section. This will force pmemd.cuda to compute energies every 10  
>>> steps
>>> > > (for energy averaging), which will in turn make the followed code  
>>> path
>>> > > identical for any multiple-of-10 value of ntpr.
>>> > >
>>> > > --
>>> > > Jason M. Swails
>>> > > Quantum Theory Project,
>>> > > University of Florida
>>> > > Ph.D. Candidate
>>> > > 352-392-4032
>>> > > _______________________________________________
>>> > > AMBER mailing list
>>> > > AMBER.ambermd.org
>>> > > http://lists.ambermd.org/mailman/listinfo/amber"
>>> > > _______________________________________________
>>> > > AMBER mailing list
>>> > > AMBER.ambermd.org
>>> > > http://lists.ambermd.org/mailman/listinfo/amber
>>> > >
>>> > _______________________________________________
>>> > AMBER mailing list
>>> > AMBER.ambermd.org
>>> > http://lists.ambermd.org/mailman/listinfo/amber
>>> > _______________________________________________
>>> > AMBER mailing list
>>> > AMBER.ambermd.org
>>> > http://lists.ambermd.org/mailman/listinfo/amber
>>> >
>>>
>>>
>>>
>>> --
>>> Jason M. Swails
>>> Quantum Theory Project,
>>> University of Florida
>>> Ph.D. Candidate
>>> 352-392-4032
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
> __________ Informace od ESET NOD32 Antivirus, verze databaze 8394  
> (20130530) __________
>
> Tuto zpravu proveril ESET NOD32 Antivirus.
>
> http://www.eset.cz
>
>
>
-- 
Tato zpráva byla vytvořena převratným poštovním klientem Opery:  
http://www.opera.com/mail/