Re: [AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ?

From: Scott Le Grand <varelse2005.gmail.com>
Date: Thu, 30 May 2013 12:24:43 -0700

If the errors are not deterministically triggered, they probably won't be
fixed by the patch, alas...
On May 30, 2013 12:15 PM, "Jason Swails" <jason.swails.gmail.com> wrote:

> Just a reminder to everyone based on what Ross said: there is a pending
> patch to pmemd.cuda that will be coming out shortly (maybe even within
> hours). It's entirely possible that several of these errors are fixed by
> this patch.
>
> All the best,
> Jason
>
>
> On Thu, May 30, 2013 at 2:46 PM, filip fratev <filipfratev.yahoo.com>
> wrote:
>
> > I have observed the same crashes from time to time. I will run cellulose
> > nve for 100k and will past results here.
> >
> > All the best,
> > Filip
> >
> >
> >
> >
> > ________________________________
> > From: Scott Le Grand <varelse2005.gmail.com>
> > To: AMBER Mailing List <amber.ambermd.org>
> > Sent: Thursday, May 30, 2013 9:01 PM
> > Subject: Re: [AMBER] experiences with EVGA GTX TITAN Superclocked -
> > memtestG80 - UNDERclocking in Linux ?
> >
> >
> > Run cellulose nve for 100k iterations twice . If the final energies
> don't
> > match, you have a hardware issue. No need to play with ntpr or any other
> > variable.
> > On May 30, 2013 10:58 AM, <pavel.banas.upol.cz> wrote:
> >
> > >
> > > Dear all,
> > >
> > > I would also like to share one of my experience with titan cards. We
> have
> > > one gtx titan card and with one system (~55k atoms, NVT, RNA+waters) we
> > run
> > > into same troubles you are describing. I was also playing with ntpr to
> > > figure out what is going on, step by step. I understand that the code
> is
> > > using different routines for calculation energies+forces or only
> forces.
> > > The
> > > simulations of other systems are perfectly stable, running for days and
> > > weeks. Only that particular system systematically ends up with this
> > error.
> > >
> > > However, there was one interesting issue. When I set ntpr=1, the error
> > > vanished (systematically in multiple runs) and the simulation was able
> to
> > > run for more than millions of steps (I was not let it running for weeks
> > as
> > > in the meantime I shifted that simulation to other card - need data,
> not
> > > testing). All other setting of ntpr failed. As I read this discussion,
> I
> > > tried to set ene_avg_sampling=1 with some high value of ntpr (I
> expected
> > > that this will shift the code to permanently use the force+energies
> part
> > of
> > > the code, similarly to ntpr=1), but the error occurred again.
> > >
> > > I know it is not very conclusive for finding out what is happening, at
> > > least
> > > not for me. Do you have any idea, why ntpr=1 might help?
> > >
> > > best regards,
> > >
> > > Pavel
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Pavel Banáš
> > > pavel.banas.upol.cz
> > > Department of Physical Chemistry,
> > > Palacky University Olomouc
> > > Czech Republic
> > >
> > >
> > >
> > > ---------- Původní zpráva ----------
> > > Od: Jason Swails <jason.swails.gmail.com>
> > > Datum: 29. 5. 2013
> > > Předmět: Re: [AMBER] experiences with EVGA GTX TITAN Superclocked -
> > > memtestG
> > > 80 - UNDERclocking in Linux ?
> > >
> > > "I'll answer a little bit:
> > >
> > > NTPR=10 Etot after 2000 steps
> > > >
> > > > -443256.6711
> > > > -443256.6711
> > > >
> > > > NTPR=200 Etot after 2000 steps
> > > >
> > > > -443261.0705
> > > > -443261.0705
> > > >
> > > > Any idea why energies should depend on frequency of energy records
> > (NTPR)
> > > ?
> > > >
> > >
> > > It is a subtle point, but the answer is 'different code paths.' In
> > > general, it is NEVER necessary to compute the actual energy of a
> molecule
> > > during the course of standard molecular dynamics (by analogy, it is
> NEVER
> > > necessary to compute atomic forces during the course of random Monte
> > Carlo
> > > sampling).
> > >
> > > For performance's sake, then, pmemd.cuda computes only the force when
> > > energies are not requested, leading to a different order of operations
> > for
> > > those runs. This difference ultimately causes divergence.
> > >
> > > To test this, try setting the variable ene_avg_sampling=10 in the
> &cntrl
> > > section. This will force pmemd.cuda to compute energies every 10 steps
> > > (for energy averaging), which will in turn make the followed code path
> > > identical for any multiple-of-10 value of ntpr.
> > >
> > > --
> > > Jason M. Swails
> > > Quantum Theory Project,
> > > University of Florida
> > > Ph.D. Candidate
> > > 352-392-4032
> > > _______________________________________________
> > > AMBER mailing list
> > > AMBER.ambermd.org
> > > http://lists.ambermd.org/mailman/listinfo/amber"
> > > _______________________________________________
> > > AMBER mailing list
> > > AMBER.ambermd.org
> > > http://lists.ambermd.org/mailman/listinfo/amber
> > >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
>
>
>
> --
> Jason M. Swails
> Quantum Theory Project,
> University of Florida
> Ph.D. Candidate
> 352-392-4032
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu May 30 2013 - 12:30:03 PDT
Custom Search