Re: [AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ?

From: ET <sketchfoot.gmail.com>
Date: Fri, 31 May 2013 11:21:21 +0100

ps. I have another install of amber on another computer with a different
Titan and different Driver Version: 310.44.

In the interests of thrashing the proverbial horse, I'll run the benchmark
for 50k steps. :P

br,
g


On 31 May 2013 11:17, ET <sketchfoot.gmail.com> wrote:

> Hi, I just ran the Amber benchmark for the default (10000 steps) on my
> Titan.
>
> Using sdiff -sB showed that the two runs were completely identical. I've
> attached compressed files of the mdout & diff files.
>
> br,
> g
>
>
> On 30 May 2013 23:41, Marek Maly <marek.maly.ujep.cz> wrote:
>
>> OK, let's see. The eventual downclocking I see as the very last
>> possibility
>> (if I don't decide for RMAing). But now still some other experiments are
>> available :))
>> I just started 100K tests under 313.30 driver. For today good night ...
>>
>> M.
>>
>> Dne Fri, 31 May 2013 00:45:49 +0200 Scott Le Grand <varelse2005.gmail.com
>> >
>> napsal/-a:
>>
>> > It will be very interesting if this behavior persists after
>> downclocking.
>> >
>> > But right now, Titan 0 *looks* hosed and Titan 1 *looks* like it needs
>> > downclocking...
>> > On May 30, 2013 3:20 PM, "Marek Maly" <marek.maly.ujep.cz> wrote:
>> >
>> >> Hi all,
>> >>
>> >> here are my results from the 500K steps 2 x repeated benchmarks
>> >> under 319.23 driver and still Cuda 5.0 (see the attached table ).
>> >>
>> >> It is hard to say if the results are better or worse than in my
>> >> previous 100K test under driver 319.17.
>> >>
>> >> While results from Cellulose test were improved and the TITAN_1 card
>> >> even
>> >> successfully finished all 500K steps moreover with exactly the same
>> >> final
>> >> energy !
>> >> (TITAN_0 at least finished more than 100K steps and in RUN_01 even more
>> >> than 400K steps)
>> >> In JAC_NPT test no GPU was able to finish at least 100K steps and the
>> >> results from JAC_NVE
>> >> test are also not too much convincing. FACTOR_IX_NVE and FACTOR_IX_NPT
>> >> were successfully
>> >> finished with 100% reproducibility in FACTOR_IX_NPT case (on both
>> >> cards)
>> >> and almost
>> >> 100% reproducibility in case of FACTOR_IX_NVE (again 100% in case of
>> >> TITAN_1). TRPCAGE, MYOGLOBIN
>> >> again finished without any problem with 100% reproducibility.
>> NUCLEOSOME
>> >> test was not done
>> >> this time due to high time requirements. If you find in the table
>> >> positive
>> >> number finishing with
>> >> K (which means "thousands") it means the last number of step written in
>> >> mdout before crash.
>> >> Below are all the 3 types of detected errs with relevant systems/rounds
>> >> where the given err
>> >> appeared.
>> >>
>> >> Now I will try just 100K tests under ETs favourite driver version
>> 313.30
>> >> :)) and then
>> >> I will eventually try to experiment with cuda 5.5 which I already
>> >> downloaded from the
>> >> cuda zone ( I had to become cuda developer for this :)) ) BTW ET thanks
>> >> for the frequency info !
>> >> and I am still ( perhaps not alone :)) ) very curious about your 2 x
>> >> repeated Amber benchmark tests with superclocked Titan. Indeed that I
>> am
>> >> very curious also about that Ross "hot" patch.
>> >>
>> >> M.
>> >>
>> >> ERRORS DETECTED DURING THE 500K steps tests with driver 319.23
>> >>
>> >> #1 ERR writtent in mdout:
>> >> ------
>> >> | ERROR: max pairlist cutoff must be less than unit cell max sphere
>> >> radius!
>> >> ------
>> >>
>> >> TITAN_0 ROUND_1 JAC_NPT (at least 5000 steps successfully done before
>> >> crash)
>> >> TITAN_0 ROUND_2 JAC_NPT (at least 8000 steps successfully done before
>> >> crash)
>> >>
>> >>
>> >> #2 no ERR writtent in mdout, ERR written in standard output (nohup.out)
>> >>
>> >> ----
>> >> Error: unspecified launch failure launching kernel kNLSkinTest
>> >> cudaFree GpuBuffer::Deallocate failed unspecified launch failure
>> >> ----
>> >>
>> >> TITAN_0 ROUND_1 CELLULOSE_NVE (at least 437 000 steps successfully done
>> >> before crash)
>> >> TITAN_0 ROUND_2 JAC_NVE (at least 162 000 steps successfully done
>> >> before
>> >> crash)
>> >> TITAN_0 ROUND_2 CELLULOSE_NVE (at least 117 000 steps successfully done
>> >> before crash)
>> >> TITAN_1 ROUND_1 JAC_NVE (at least 119 000 steps successfully done
>> >> before
>> >> crash)
>> >> TITAN_1 ROUND_2 JAC_NVE (at least 43 000 steps successfully done
>> before
>> >> crash)
>> >>
>> >>
>> >> #3 no ERR writtent in mdout, ERR written in standard output (nohup.out)
>> >> ----
>> >> cudaMemcpy GpuBuffer::Download failed unspecified launch failure
>> >> ----
>> >>
>> >> TITAN_1 ROUND_1 JAC_NPT (at least 77 000 steps successfully done
>> before
>> >> crash)
>> >> TITAN_1 ROUND_2 JAC_NPT (at least 58 000 steps successfully done
>> before
>> >> crash)
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> Dne Thu, 30 May 2013 21:27:17 +0200 Scott Le Grand
>> >> <varelse2005.gmail.com>
>> >> napsal/-a:
>> >>
>> >> Oops meant to send that to Jason...
>> >>>
>> >>> Anyway, before we all panic, we need to get K20's behavior analyzed
>> >>> here.
>> >>> If it's deterministic, this truly is a hardware issue. If not, then
>> it
>> >>> gets interesting because 680 is deterministic as far as I can tell...
>> >>> On May 30, 2013 12:24 PM, "Scott Le Grand" <varelse2005.gmail.com>
>> >>> wrote:
>> >>>
>> >>> If the errors are not deterministically triggered, they probably
>> >>> won't be
>> >>>> fixed by the patch, alas...
>> >>>> On May 30, 2013 12:15 PM, "Jason Swails" <jason.swails.gmail.com>
>> >>>> wrote:
>> >>>>
>> >>>> Just a reminder to everyone based on what Ross said: there is a
>> >>>> pending
>> >>>>> patch to pmemd.cuda that will be coming out shortly (maybe even
>> >>>>> within
>> >>>>> hours). It's entirely possible that several of these errors are
>> >>>>> fixed
>> >>>>> by
>> >>>>> this patch.
>> >>>>>
>> >>>>> All the best,
>> >>>>> Jason
>> >>>>>
>> >>>>>
>> >>>>> On Thu, May 30, 2013 at 2:46 PM, filip fratev <
>> filipfratev.yahoo.com>
>> >>>>> wrote:
>> >>>>>
>> >>>>> > I have observed the same crashes from time to time. I will run
>> >>>>> cellulose
>> >>>>> > nve for 100k and will past results here.
>> >>>>> >
>> >>>>> > All the best,
>> >>>>> > Filip
>> >>>>> >
>> >>>>> >
>> >>>>> >
>> >>>>> >
>> >>>>> > ______________________________**__
>> >>>>> > From: Scott Le Grand <varelse2005.gmail.com>
>> >>>>> > To: AMBER Mailing List <amber.ambermd.org>
>> >>>>> > Sent: Thursday, May 30, 2013 9:01 PM
>> >>>>> > Subject: Re: [AMBER] experiences with EVGA GTX TITAN Superclocked
>> -
>> >>>>> > memtestG80 - UNDERclocking in Linux ?
>> >>>>> >
>> >>>>> >
>> >>>>> > Run cellulose nve for 100k iterations twice . If the final
>> >>>>> energies
>> >>>>> don't
>> >>>>> > match, you have a hardware issue. No need to play with ntpr or
>> any
>> >>>>> other
>> >>>>> > variable.
>> >>>>> > On May 30, 2013 10:58 AM, <pavel.banas.upol.cz> wrote:
>> >>>>> >
>> >>>>> > >
>> >>>>> > > Dear all,
>> >>>>> > >
>> >>>>> > > I would also like to share one of my experience with titan
>> >>>>> cards. We
>> >>>>> have
>> >>>>> > > one gtx titan card and with one system (~55k atoms, NVT,
>> >>>>> RNA+waters)
>> >>>>> we
>> >>>>> > run
>> >>>>> > > into same troubles you are describing. I was also playing with
>> >>>>> ntpr
>> >>>>> to
>> >>>>> > > figure out what is going on, step by step. I understand that the
>> >>>>> code
>> >>>>> is
>> >>>>> > > using different routines for calculation energies+forces or only
>> >>>>> forces.
>> >>>>> > > The
>> >>>>> > > simulations of other systems are perfectly stable, running for
>> >>>>> days
>> >>>>> and
>> >>>>> > > weeks. Only that particular system systematically ends up with
>> >>>>> this
>> >>>>> > error.
>> >>>>> > >
>> >>>>> > > However, there was one interesting issue. When I set ntpr=1, the
>> >>>>> error
>> >>>>> > > vanished (systematically in multiple runs) and the simulation
>> was
>> >>>>> able to
>> >>>>> > > run for more than millions of steps (I was not let it running
>> for
>> >>>>> weeks
>> >>>>> > as
>> >>>>> > > in the meantime I shifted that simulation to other card - need
>> >>>>> data,
>> >>>>> not
>> >>>>> > > testing). All other setting of ntpr failed. As I read this
>> >>>>> discussion, I
>> >>>>> > > tried to set ene_avg_sampling=1 with some high value of ntpr (I
>> >>>>> expected
>> >>>>> > > that this will shift the code to permanently use the
>> >>>>> force+energies
>> >>>>> part
>> >>>>> > of
>> >>>>> > > the code, similarly to ntpr=1), but the error occurred again.
>> >>>>> > >
>> >>>>> > > I know it is not very conclusive for finding out what is
>> >>>>> happening,
>> >>>>> at
>> >>>>> > > least
>> >>>>> > > not for me. Do you have any idea, why ntpr=1 might help?
>> >>>>> > >
>> >>>>> > > best regards,
>> >>>>> > >
>> >>>>> > > Pavel
>> >>>>> > >
>> >>>>> > >
>> >>>>> > >
>> >>>>> > >
>> >>>>> > >
>> >>>>> > > --
>> >>>>> > > Pavel Banáš
>> >>>>> > > pavel.banas.upol.cz
>> >>>>> > > Department of Physical Chemistry,
>> >>>>> > > Palacky University Olomouc
>> >>>>> > > Czech Republic
>> >>>>> > >
>> >>>>> > >
>> >>>>> > >
>> >>>>> > > ---------- Původní zpráva ----------
>> >>>>> > > Od: Jason Swails <jason.swails.gmail.com>
>> >>>>> > > Datum: 29. 5. 2013
>> >>>>> > > Předmět: Re: [AMBER] experiences with EVGA GTX TITAN
>> >>>>> Superclocked -
>> >>>>> > > memtestG
>> >>>>> > > 80 - UNDERclocking in Linux ?
>> >>>>> > >
>> >>>>> > > "I'll answer a little bit:
>> >>>>> > >
>> >>>>> > > NTPR=10 Etot after 2000 steps
>> >>>>> > > >
>> >>>>> > > > -443256.6711
>> >>>>> > > > -443256.6711
>> >>>>> > > >
>> >>>>> > > > NTPR=200 Etot after 2000 steps
>> >>>>> > > >
>> >>>>> > > > -443261.0705
>> >>>>> > > > -443261.0705
>> >>>>> > > >
>> >>>>> > > > Any idea why energies should depend on frequency of energy
>> >>>>> records
>> >>>>> > (NTPR)
>> >>>>> > > ?
>> >>>>> > > >
>> >>>>> > >
>> >>>>> > > It is a subtle point, but the answer is 'different code paths.'
>> >>>>> In
>> >>>>> > > general, it is NEVER necessary to compute the actual energy of a
>> >>>>> molecule
>> >>>>> > > during the course of standard molecular dynamics (by analogy, it
>> >>>>> is
>> >>>>> NEVER
>> >>>>> > > necessary to compute atomic forces during the course of random
>> >>>>> Monte
>> >>>>> > Carlo
>> >>>>> > > sampling).
>> >>>>> > >
>> >>>>> > > For performance's sake, then, pmemd.cuda computes only the force
>> >>>>> when
>> >>>>> > > energies are not requested, leading to a different order of
>> >>>>> operations
>> >>>>> > for
>> >>>>> > > those runs. This difference ultimately causes divergence.
>> >>>>> > >
>> >>>>> > > To test this, try setting the variable ene_avg_sampling=10 in
>> the
>> >>>>> &cntrl
>> >>>>> > > section. This will force pmemd.cuda to compute energies every 10
>> >>>>> steps
>> >>>>> > > (for energy averaging), which will in turn make the followed
>> code
>> >>>>> path
>> >>>>> > > identical for any multiple-of-10 value of ntpr.
>> >>>>> > >
>> >>>>> > > --
>> >>>>> > > Jason M. Swails
>> >>>>> > > Quantum Theory Project,
>> >>>>> > > University of Florida
>> >>>>> > > Ph.D. Candidate
>> >>>>> > > 352-392-4032
>> >>>>> > > ______________________________**_________________
>> >>>>> > > AMBER mailing list
>> >>>>> > > AMBER.ambermd.org
>> >>>>> > >
>> >>>>> http://lists.ambermd.org/**mailman/listinfo/amber<
>> http://lists.ambermd.org/mailman/listinfo/amber>
>> >>>>> "
>> >>>>> > > ______________________________**_________________
>> >>>>> > > AMBER mailing list
>> >>>>> > > AMBER.ambermd.org
>> >>>>> > >
>> >>>>> http://lists.ambermd.org/**mailman/listinfo/amber<
>> http://lists.ambermd.org/mailman/listinfo/amber>
>> >>>>> > >
>> >>>>> > ______________________________**_________________
>> >>>>> > AMBER mailing list
>> >>>>> > AMBER.ambermd.org
>> >>>>> >
>> >>>>> http://lists.ambermd.org/**mailman/listinfo/amber<
>> http://lists.ambermd.org/mailman/listinfo/amber>
>> >>>>> > ______________________________**_________________
>> >>>>> > AMBER mailing list
>> >>>>> > AMBER.ambermd.org
>> >>>>> >
>> >>>>> http://lists.ambermd.org/**mailman/listinfo/amber<
>> http://lists.ambermd.org/mailman/listinfo/amber>
>> >>>>> >
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> Jason M. Swails
>> >>>>> Quantum Theory Project,
>> >>>>> University of Florida
>> >>>>> Ph.D. Candidate
>> >>>>> 352-392-4032
>> >>>>> ______________________________**_________________
>> >>>>> AMBER mailing list
>> >>>>> AMBER.ambermd.org
>> >>>>> http://lists.ambermd.org/**mailman/listinfo/amber<
>> http://lists.ambermd.org/mailman/listinfo/amber>
>> >>>>>
>> >>>>>
>> >>>> ______________________________**_________________
>> >>> AMBER mailing list
>> >>> AMBER.ambermd.org
>> >>> http://lists.ambermd.org/**mailman/listinfo/amber<
>> http://lists.ambermd.org/mailman/listinfo/amber>
>> >>>
>> >>> __________ Informace od ESET NOD32 Antivirus, verze databaze 8394
>> >>> (20130530) __________
>> >>>
>> >>> Tuto zpravu proveril ESET NOD32 Antivirus.
>> >>>
>> >>> http://www.eset.cz
>> >>>
>> >>>
>> >>>
>> >>>
>> >>
>> >> --
>> >> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>> >> http://www.opera.com/mail/
>> >> _______________________________________________
>> >> AMBER mailing list
>> >> AMBER.ambermd.org
>> >> http://lists.ambermd.org/mailman/listinfo/amber
>> >>
>> >>
>> > _______________________________________________
>> > AMBER mailing list
>> > AMBER.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber
>> >
>> > __________ Informace od ESET NOD32 Antivirus, verze databaze 8394
>> > (20130530) __________
>> >
>> > Tuto zpravu proveril ESET NOD32 Antivirus.
>> >
>> > http://www.eset.cz
>> >
>> >
>> >
>>
>>
>> --
>> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>> http://www.opera.com/mail/
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri May 31 2013 - 03:30:05 PDT
Custom Search