Re: [AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ? from ET on 2013-05-31 (Amber Archive May 2013)

From: ET <sketchfoot.gmail.com>
Date: Fri, 31 May 2013 11:17:07 +0100

Hi, I just ran the Amber benchmark for the default (10000 steps) on my
Titan.

Using sdiff -sB showed that the two runs were completely identical. I've
attached compressed files of the mdout & diff files.

br,
g

On 30 May 2013 23:41, Marek Maly <marek.maly.ujep.cz> wrote:

> OK, let's see. The eventual downclocking I see as the very last possibility
> (if I don't decide for RMAing). But now still some other experiments are
> available :))
> I just started 100K tests under 313.30 driver. For today good night ...
>
> M.
>
> Dne Fri, 31 May 2013 00:45:49 +0200 Scott Le Grand <varelse2005.gmail.com>
> napsal/-a:
>
> > It will be very interesting if this behavior persists after downclocking.
> >
> > But right now, Titan 0 *looks* hosed and Titan 1 *looks* like it needs
> > downclocking...
> > On May 30, 2013 3:20 PM, "Marek Maly" <marek.maly.ujep.cz> wrote:
> >
> >> Hi all,
> >>
> >> here are my results from the 500K steps 2 x repeated benchmarks
> >> under 319.23 driver and still Cuda 5.0 (see the attached table ).
> >>
> >> It is hard to say if the results are better or worse than in my
> >> previous 100K test under driver 319.17.
> >>
> >> While results from Cellulose test were improved and the TITAN_1 card
> >> even
> >> successfully finished all 500K steps moreover with exactly the same
> >> final
> >> energy !
> >> (TITAN_0 at least finished more than 100K steps and in RUN_01 even more
> >> than 400K steps)
> >> In JAC_NPT test no GPU was able to finish at least 100K steps and the
> >> results from JAC_NVE
> >> test are also not too much convincing. FACTOR_IX_NVE and FACTOR_IX_NPT
> >> were successfully
> >> finished with 100% reproducibility in FACTOR_IX_NPT case (on both
> >> cards)
> >> and almost
> >> 100% reproducibility in case of FACTOR_IX_NVE (again 100% in case of
> >> TITAN_1). TRPCAGE, MYOGLOBIN
> >> again finished without any problem with 100% reproducibility. NUCLEOSOME
> >> test was not done
> >> this time due to high time requirements. If you find in the table
> >> positive
> >> number finishing with
> >> K (which means "thousands") it means the last number of step written in
> >> mdout before crash.
> >> Below are all the 3 types of detected errs with relevant systems/rounds
> >> where the given err
> >> appeared.
> >>
> >> Now I will try just 100K tests under ETs favourite driver version 313.30
> >> :)) and then
> >> I will eventually try to experiment with cuda 5.5 which I already
> >> downloaded from the
> >> cuda zone ( I had to become cuda developer for this :)) ) BTW ET thanks
> >> for the frequency info !
> >> and I am still ( perhaps not alone :)) ) very curious about your 2 x
> >> repeated Amber benchmark tests with superclocked Titan. Indeed that I am
> >> very curious also about that Ross "hot" patch.
> >>
> >> M.
> >>
> >> ERRORS DETECTED DURING THE 500K steps tests with driver 319.23
> >>
> >> #1 ERR writtent in mdout:
> >> ------
> >> | ERROR: max pairlist cutoff must be less than unit cell max sphere
> >> radius!
> >> ------
> >>
> >> TITAN_0 ROUND_1 JAC_NPT (at least 5000 steps successfully done before
> >> crash)
> >> TITAN_0 ROUND_2 JAC_NPT (at least 8000 steps successfully done before
> >> crash)
> >>
> >>
> >> #2 no ERR writtent in mdout, ERR written in standard output (nohup.out)
> >>
> >> ----
> >> Error: unspecified launch failure launching kernel kNLSkinTest
> >> cudaFree GpuBuffer::Deallocate failed unspecified launch failure
> >> ----
> >>
> >> TITAN_0 ROUND_1 CELLULOSE_NVE (at least 437 000 steps successfully done
> >> before crash)
> >> TITAN_0 ROUND_2 JAC_NVE (at least 162 000 steps successfully done
> >> before
> >> crash)
> >> TITAN_0 ROUND_2 CELLULOSE_NVE (at least 117 000 steps successfully done
> >> before crash)
> >> TITAN_1 ROUND_1 JAC_NVE (at least 119 000 steps successfully done
> >> before
> >> crash)
> >> TITAN_1 ROUND_2 JAC_NVE (at least 43 000 steps successfully done before
> >> crash)
> >>
> >>
> >> #3 no ERR writtent in mdout, ERR written in standard output (nohup.out)
> >> ----
> >> cudaMemcpy GpuBuffer::Download failed unspecified launch failure
> >> ----
> >>
> >> TITAN_1 ROUND_1 JAC_NPT (at least 77 000 steps successfully done before
> >> crash)
> >> TITAN_1 ROUND_2 JAC_NPT (at least 58 000 steps successfully done before
> >> crash)
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> Dne Thu, 30 May 2013 21:27:17 +0200 Scott Le Grand
> >> <varelse2005.gmail.com>
> >> napsal/-a:
> >>
> >> Oops meant to send that to Jason...
> >>>
> >>> Anyway, before we all panic, we need to get K20's behavior analyzed
> >>> here.
> >>> If it's deterministic, this truly is a hardware issue. If not, then it
> >>> gets interesting because 680 is deterministic as far as I can tell...
> >>> On May 30, 2013 12:24 PM, "Scott Le Grand" <varelse2005.gmail.com>
> >>> wrote:
> >>>
> >>> If the errors are not deterministically triggered, they probably
> >>> won't be
> >>>> fixed by the patch, alas...
> >>>> On May 30, 2013 12:15 PM, "Jason Swails" <jason.swails.gmail.com>
> >>>> wrote:
> >>>>
> >>>> Just a reminder to everyone based on what Ross said: there is a
> >>>> pending
> >>>>> patch to pmemd.cuda that will be coming out shortly (maybe even
> >>>>> within
> >>>>> hours). It's entirely possible that several of these errors are
> >>>>> fixed
> >>>>> by
> >>>>> this patch.
> >>>>>
> >>>>> All the best,
> >>>>> Jason
> >>>>>
> >>>>>
> >>>>> On Thu, May 30, 2013 at 2:46 PM, filip fratev <filipfratev.yahoo.com
> >
> >>>>> wrote:
> >>>>>
> >>>>> > I have observed the same crashes from time to time. I will run
> >>>>> cellulose
> >>>>> > nve for 100k and will past results here.
> >>>>> >
> >>>>> > All the best,
> >>>>> > Filip
> >>>>> >
> >>>>> >
> >>>>> >
> >>>>> >
> >>>>> > ______________________________**__
> >>>>> > From: Scott Le Grand <varelse2005.gmail.com>
> >>>>> > To: AMBER Mailing List <amber.ambermd.org>
> >>>>> > Sent: Thursday, May 30, 2013 9:01 PM
> >>>>> > Subject: Re: [AMBER] experiences with EVGA GTX TITAN Superclocked -
> >>>>> > memtestG80 - UNDERclocking in Linux ?
> >>>>> >
> >>>>> >
> >>>>> > Run cellulose nve for 100k iterations twice . If the final
> >>>>> energies
> >>>>> don't
> >>>>> > match, you have a hardware issue. No need to play with ntpr or any
> >>>>> other
> >>>>> > variable.
> >>>>> > On May 30, 2013 10:58 AM, <pavel.banas.upol.cz> wrote:
> >>>>> >
> >>>>> > >
> >>>>> > > Dear all,
> >>>>> > >
> >>>>> > > I would also like to share one of my experience with titan
> >>>>> cards. We
> >>>>> have
> >>>>> > > one gtx titan card and with one system (~55k atoms, NVT,
> >>>>> RNA+waters)
> >>>>> we
> >>>>> > run
> >>>>> > > into same troubles you are describing. I was also playing with
> >>>>> ntpr
> >>>>> to
> >>>>> > > figure out what is going on, step by step. I understand that the
> >>>>> code
> >>>>> is
> >>>>> > > using different routines for calculation energies+forces or only
> >>>>> forces.
> >>>>> > > The
> >>>>> > > simulations of other systems are perfectly stable, running for
> >>>>> days
> >>>>> and
> >>>>> > > weeks. Only that particular system systematically ends up with
> >>>>> this
> >>>>> > error.
> >>>>> > >
> >>>>> > > However, there was one interesting issue. When I set ntpr=1, the
> >>>>> error
> >>>>> > > vanished (systematically in multiple runs) and the simulation was
> >>>>> able to
> >>>>> > > run for more than millions of steps (I was not let it running for
> >>>>> weeks
> >>>>> > as
> >>>>> > > in the meantime I shifted that simulation to other card - need
> >>>>> data,
> >>>>> not
> >>>>> > > testing). All other setting of ntpr failed. As I read this
> >>>>> discussion, I
> >>>>> > > tried to set ene_avg_sampling=1 with some high value of ntpr (I
> >>>>> expected
> >>>>> > > that this will shift the code to permanently use the
> >>>>> force+energies
> >>>>> part
> >>>>> > of
> >>>>> > > the code, similarly to ntpr=1), but the error occurred again.
> >>>>> > >
> >>>>> > > I know it is not very conclusive for finding out what is
> >>>>> happening,
> >>>>> at
> >>>>> > > least
> >>>>> > > not for me. Do you have any idea, why ntpr=1 might help?
> >>>>> > >
> >>>>> > > best regards,
> >>>>> > >
> >>>>> > > Pavel
> >>>>> > >
> >>>>> > >
> >>>>> > >
> >>>>> > >
> >>>>> > >
> >>>>> > > --
> >>>>> > > Pavel Banáš
> >>>>> > > pavel.banas.upol.cz
> >>>>> > > Department of Physical Chemistry,
> >>>>> > > Palacky University Olomouc
> >>>>> > > Czech Republic
> >>>>> > >
> >>>>> > >
> >>>>> > >
> >>>>> > > ---------- Původní zpráva ----------
> >>>>> > > Od: Jason Swails <jason.swails.gmail.com>
> >>>>> > > Datum: 29. 5. 2013
> >>>>> > > Předmět: Re: [AMBER] experiences with EVGA GTX TITAN
> >>>>> Superclocked -
> >>>>> > > memtestG
> >>>>> > > 80 - UNDERclocking in Linux ?
> >>>>> > >
> >>>>> > > "I'll answer a little bit:
> >>>>> > >
> >>>>> > > NTPR=10 Etot after 2000 steps
> >>>>> > > >
> >>>>> > > > -443256.6711
> >>>>> > > > -443256.6711
> >>>>> > > >
> >>>>> > > > NTPR=200 Etot after 2000 steps
> >>>>> > > >
> >>>>> > > > -443261.0705
> >>>>> > > > -443261.0705
> >>>>> > > >
> >>>>> > > > Any idea why energies should depend on frequency of energy
> >>>>> records
> >>>>> > (NTPR)
> >>>>> > > ?
> >>>>> > > >
> >>>>> > >
> >>>>> > > It is a subtle point, but the answer is 'different code paths.'
> >>>>> In
> >>>>> > > general, it is NEVER necessary to compute the actual energy of a
> >>>>> molecule
> >>>>> > > during the course of standard molecular dynamics (by analogy, it
> >>>>> is
> >>>>> NEVER
> >>>>> > > necessary to compute atomic forces during the course of random
> >>>>> Monte
> >>>>> > Carlo
> >>>>> > > sampling).
> >>>>> > >
> >>>>> > > For performance's sake, then, pmemd.cuda computes only the force
> >>>>> when
> >>>>> > > energies are not requested, leading to a different order of
> >>>>> operations
> >>>>> > for
> >>>>> > > those runs. This difference ultimately causes divergence.
> >>>>> > >
> >>>>> > > To test this, try setting the variable ene_avg_sampling=10 in the
> >>>>> &cntrl
> >>>>> > > section. This will force pmemd.cuda to compute energies every 10
> >>>>> steps
> >>>>> > > (for energy averaging), which will in turn make the followed code
> >>>>> path
> >>>>> > > identical for any multiple-of-10 value of ntpr.
> >>>>> > >
> >>>>> > > --
> >>>>> > > Jason M. Swails
> >>>>> > > Quantum Theory Project,
> >>>>> > > University of Florida
> >>>>> > > Ph.D. Candidate
> >>>>> > > 352-392-4032
> >>>>> > > ______________________________**_________________
> >>>>> > > AMBER mailing list
> >>>>> > > AMBER.ambermd.org
> >>>>> > >
> >>>>> http://lists.ambermd.org/**mailman/listinfo/amber<
> http://lists.ambermd.org/mailman/listinfo/amber>
> >>>>> "
> >>>>> > > ______________________________**_________________
> >>>>> > > AMBER mailing list
> >>>>> > > AMBER.ambermd.org
> >>>>> > >
> >>>>> http://lists.ambermd.org/**mailman/listinfo/amber<
> http://lists.ambermd.org/mailman/listinfo/amber>
> >>>>> > >
> >>>>> > ______________________________**_________________
> >>>>> > AMBER mailing list
> >>>>> > AMBER.ambermd.org
> >>>>> >
> >>>>> http://lists.ambermd.org/**mailman/listinfo/amber<
> http://lists.ambermd.org/mailman/listinfo/amber>
> >>>>> > ______________________________**_________________
> >>>>> > AMBER mailing list
> >>>>> > AMBER.ambermd.org
> >>>>> >
> >>>>> http://lists.ambermd.org/**mailman/listinfo/amber<
> http://lists.ambermd.org/mailman/listinfo/amber>
> >>>>> >
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Jason M. Swails
> >>>>> Quantum Theory Project,
> >>>>> University of Florida
> >>>>> Ph.D. Candidate
> >>>>> 352-392-4032
> >>>>> ______________________________**_________________
> >>>>> AMBER mailing list
> >>>>> AMBER.ambermd.org
> >>>>> http://lists.ambermd.org/**mailman/listinfo/amber<
> http://lists.ambermd.org/mailman/listinfo/amber>
> >>>>>
> >>>>>
> >>>> ______________________________**_________________
> >>> AMBER mailing list
> >>> AMBER.ambermd.org
> >>> http://lists.ambermd.org/**mailman/listinfo/amber<
> http://lists.ambermd.org/mailman/listinfo/amber>
> >>>
> >>> __________ Informace od ESET NOD32 Antivirus, verze databaze 8394
> >>> (20130530) __________
> >>>
> >>> Tuto zpravu proveril ESET NOD32 Antivirus.
> >>>
> >>> http://www.eset.cz
> >>>
> >>>
> >>>
> >>>
> >>
> >> --
> >> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
> >> http://www.opera.com/mail/
> >> _______________________________________________
> >> AMBER mailing list
> >> AMBER.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber
> >>
> >>
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
> > __________ Informace od ESET NOD32 Antivirus, verze databaze 8394
> > (20130530) __________
> >
> > Tuto zpravu proveril ESET NOD32 Antivirus.
> >
> > http://www.eset.cz
> >
> >
> >
>
>
> --
> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
> http://www.opera.com/mail/
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

application/x-gzip attachment: PME_out_plus_diff_Files.tar.gz

application/x-gzip attachment: GB_out_plus_diff_Files.tar.gz

Received on Fri May 31 2013 - 03:30:04 PDT