Re: [AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ?

From: Scott Le Grand <varelse2005.gmail.com>
Date: Thu, 30 May 2013 12:27:17 -0700

Oops meant to send that to Jason...

Anyway, before we all panic, we need to get K20's behavior analyzed here.
If it's deterministic, this truly is a hardware issue. If not, then it
gets interesting because 680 is deterministic as far as I can tell...
On May 30, 2013 12:24 PM, "Scott Le Grand" <varelse2005.gmail.com> wrote:

> If the errors are not deterministically triggered, they probably won't be
> fixed by the patch, alas...
> On May 30, 2013 12:15 PM, "Jason Swails" <jason.swails.gmail.com> wrote:
>
>> Just a reminder to everyone based on what Ross said: there is a pending
>> patch to pmemd.cuda that will be coming out shortly (maybe even within
>> hours). It's entirely possible that several of these errors are fixed by
>> this patch.
>>
>> All the best,
>> Jason
>>
>>
>> On Thu, May 30, 2013 at 2:46 PM, filip fratev <filipfratev.yahoo.com>
>> wrote:
>>
>> > I have observed the same crashes from time to time. I will run
>> cellulose
>> > nve for 100k and will past results here.
>> >
>> > All the best,
>> > Filip
>> >
>> >
>> >
>> >
>> > ________________________________
>> > From: Scott Le Grand <varelse2005.gmail.com>
>> > To: AMBER Mailing List <amber.ambermd.org>
>> > Sent: Thursday, May 30, 2013 9:01 PM
>> > Subject: Re: [AMBER] experiences with EVGA GTX TITAN Superclocked -
>> > memtestG80 - UNDERclocking in Linux ?
>> >
>> >
>> > Run cellulose nve for 100k iterations twice . If the final energies
>> don't
>> > match, you have a hardware issue. No need to play with ntpr or any
>> other
>> > variable.
>> > On May 30, 2013 10:58 AM, <pavel.banas.upol.cz> wrote:
>> >
>> > >
>> > > Dear all,
>> > >
>> > > I would also like to share one of my experience with titan cards. We
>> have
>> > > one gtx titan card and with one system (~55k atoms, NVT, RNA+waters)
>> we
>> > run
>> > > into same troubles you are describing. I was also playing with ntpr to
>> > > figure out what is going on, step by step. I understand that the code
>> is
>> > > using different routines for calculation energies+forces or only
>> forces.
>> > > The
>> > > simulations of other systems are perfectly stable, running for days
>> and
>> > > weeks. Only that particular system systematically ends up with this
>> > error.
>> > >
>> > > However, there was one interesting issue. When I set ntpr=1, the error
>> > > vanished (systematically in multiple runs) and the simulation was
>> able to
>> > > run for more than millions of steps (I was not let it running for
>> weeks
>> > as
>> > > in the meantime I shifted that simulation to other card - need data,
>> not
>> > > testing). All other setting of ntpr failed. As I read this
>> discussion, I
>> > > tried to set ene_avg_sampling=1 with some high value of ntpr (I
>> expected
>> > > that this will shift the code to permanently use the force+energies
>> part
>> > of
>> > > the code, similarly to ntpr=1), but the error occurred again.
>> > >
>> > > I know it is not very conclusive for finding out what is happening, at
>> > > least
>> > > not for me. Do you have any idea, why ntpr=1 might help?
>> > >
>> > > best regards,
>> > >
>> > > Pavel
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > --
>> > > Pavel Banáš
>> > > pavel.banas.upol.cz
>> > > Department of Physical Chemistry,
>> > > Palacky University Olomouc
>> > > Czech Republic
>> > >
>> > >
>> > >
>> > > ---------- Původní zpráva ----------
>> > > Od: Jason Swails <jason.swails.gmail.com>
>> > > Datum: 29. 5. 2013
>> > > Předmět: Re: [AMBER] experiences with EVGA GTX TITAN Superclocked -
>> > > memtestG
>> > > 80 - UNDERclocking in Linux ?
>> > >
>> > > "I'll answer a little bit:
>> > >
>> > > NTPR=10 Etot after 2000 steps
>> > > >
>> > > > -443256.6711
>> > > > -443256.6711
>> > > >
>> > > > NTPR=200 Etot after 2000 steps
>> > > >
>> > > > -443261.0705
>> > > > -443261.0705
>> > > >
>> > > > Any idea why energies should depend on frequency of energy records
>> > (NTPR)
>> > > ?
>> > > >
>> > >
>> > > It is a subtle point, but the answer is 'different code paths.' In
>> > > general, it is NEVER necessary to compute the actual energy of a
>> molecule
>> > > during the course of standard molecular dynamics (by analogy, it is
>> NEVER
>> > > necessary to compute atomic forces during the course of random Monte
>> > Carlo
>> > > sampling).
>> > >
>> > > For performance's sake, then, pmemd.cuda computes only the force when
>> > > energies are not requested, leading to a different order of operations
>> > for
>> > > those runs. This difference ultimately causes divergence.
>> > >
>> > > To test this, try setting the variable ene_avg_sampling=10 in the
>> &cntrl
>> > > section. This will force pmemd.cuda to compute energies every 10 steps
>> > > (for energy averaging), which will in turn make the followed code path
>> > > identical for any multiple-of-10 value of ntpr.
>> > >
>> > > --
>> > > Jason M. Swails
>> > > Quantum Theory Project,
>> > > University of Florida
>> > > Ph.D. Candidate
>> > > 352-392-4032
>> > > _______________________________________________
>> > > AMBER mailing list
>> > > AMBER.ambermd.org
>> > > http://lists.ambermd.org/mailman/listinfo/amber"
>> > > _______________________________________________
>> > > AMBER mailing list
>> > > AMBER.ambermd.org
>> > > http://lists.ambermd.org/mailman/listinfo/amber
>> > >
>> > _______________________________________________
>> > AMBER mailing list
>> > AMBER.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber
>> > _______________________________________________
>> > AMBER mailing list
>> > AMBER.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber
>> >
>>
>>
>>
>> --
>> Jason M. Swails
>> Quantum Theory Project,
>> University of Florida
>> Ph.D. Candidate
>> 352-392-4032
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu May 30 2013 - 12:30:04 PDT
Custom Search