Re: [AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ?

From: Marek Maly <marek.maly.ujep.cz>
Date: Fri, 31 May 2013 00:41:54 +0200

OK, let's see. The eventual downclocking I see as the very last possibility
(if I don't decide for RMAing). But now still some other experiments are
available :))
I just started 100K tests under 313.30 driver. For today good night ...

    M.

Dne Fri, 31 May 2013 00:45:49 +0200 Scott Le Grand <varelse2005.gmail.com>
napsal/-a:

> It will be very interesting if this behavior persists after downclocking.
>
> But right now, Titan 0 *looks* hosed and Titan 1 *looks* like it needs
> downclocking...
> On May 30, 2013 3:20 PM, "Marek Maly" <marek.maly.ujep.cz> wrote:
>
>> Hi all,
>>
>> here are my results from the 500K steps 2 x repeated benchmarks
>> under 319.23 driver and still Cuda 5.0 (see the attached table ).
>>
>> It is hard to say if the results are better or worse than in my
>> previous 100K test under driver 319.17.
>>
>> While results from Cellulose test were improved and the TITAN_1 card
>> even
>> successfully finished all 500K steps moreover with exactly the same
>> final
>> energy !
>> (TITAN_0 at least finished more than 100K steps and in RUN_01 even more
>> than 400K steps)
>> In JAC_NPT test no GPU was able to finish at least 100K steps and the
>> results from JAC_NVE
>> test are also not too much convincing. FACTOR_IX_NVE and FACTOR_IX_NPT
>> were successfully
>> finished with 100% reproducibility in FACTOR_IX_NPT case (on both
>> cards)
>> and almost
>> 100% reproducibility in case of FACTOR_IX_NVE (again 100% in case of
>> TITAN_1). TRPCAGE, MYOGLOBIN
>> again finished without any problem with 100% reproducibility. NUCLEOSOME
>> test was not done
>> this time due to high time requirements. If you find in the table
>> positive
>> number finishing with
>> K (which means "thousands") it means the last number of step written in
>> mdout before crash.
>> Below are all the 3 types of detected errs with relevant systems/rounds
>> where the given err
>> appeared.
>>
>> Now I will try just 100K tests under ETs favourite driver version 313.30
>> :)) and then
>> I will eventually try to experiment with cuda 5.5 which I already
>> downloaded from the
>> cuda zone ( I had to become cuda developer for this :)) ) BTW ET thanks
>> for the frequency info !
>> and I am still ( perhaps not alone :)) ) very curious about your 2 x
>> repeated Amber benchmark tests with superclocked Titan. Indeed that I am
>> very curious also about that Ross "hot" patch.
>>
>> M.
>>
>> ERRORS DETECTED DURING THE 500K steps tests with driver 319.23
>>
>> #1 ERR writtent in mdout:
>> ------
>> | ERROR: max pairlist cutoff must be less than unit cell max sphere
>> radius!
>> ------
>>
>> TITAN_0 ROUND_1 JAC_NPT (at least 5000 steps successfully done before
>> crash)
>> TITAN_0 ROUND_2 JAC_NPT (at least 8000 steps successfully done before
>> crash)
>>
>>
>> #2 no ERR writtent in mdout, ERR written in standard output (nohup.out)
>>
>> ----
>> Error: unspecified launch failure launching kernel kNLSkinTest
>> cudaFree GpuBuffer::Deallocate failed unspecified launch failure
>> ----
>>
>> TITAN_0 ROUND_1 CELLULOSE_NVE (at least 437 000 steps successfully done
>> before crash)
>> TITAN_0 ROUND_2 JAC_NVE (at least 162 000 steps successfully done
>> before
>> crash)
>> TITAN_0 ROUND_2 CELLULOSE_NVE (at least 117 000 steps successfully done
>> before crash)
>> TITAN_1 ROUND_1 JAC_NVE (at least 119 000 steps successfully done
>> before
>> crash)
>> TITAN_1 ROUND_2 JAC_NVE (at least 43 000 steps successfully done before
>> crash)
>>
>>
>> #3 no ERR writtent in mdout, ERR written in standard output (nohup.out)
>> ----
>> cudaMemcpy GpuBuffer::Download failed unspecified launch failure
>> ----
>>
>> TITAN_1 ROUND_1 JAC_NPT (at least 77 000 steps successfully done before
>> crash)
>> TITAN_1 ROUND_2 JAC_NPT (at least 58 000 steps successfully done before
>> crash)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Dne Thu, 30 May 2013 21:27:17 +0200 Scott Le Grand
>> <varelse2005.gmail.com>
>> napsal/-a:
>>
>> Oops meant to send that to Jason...
>>>
>>> Anyway, before we all panic, we need to get K20's behavior analyzed
>>> here.
>>> If it's deterministic, this truly is a hardware issue. If not, then it
>>> gets interesting because 680 is deterministic as far as I can tell...
>>> On May 30, 2013 12:24 PM, "Scott Le Grand" <varelse2005.gmail.com>
>>> wrote:
>>>
>>> If the errors are not deterministically triggered, they probably
>>> won't be
>>>> fixed by the patch, alas...
>>>> On May 30, 2013 12:15 PM, "Jason Swails" <jason.swails.gmail.com>
>>>> wrote:
>>>>
>>>> Just a reminder to everyone based on what Ross said: there is a
>>>> pending
>>>>> patch to pmemd.cuda that will be coming out shortly (maybe even
>>>>> within
>>>>> hours). It's entirely possible that several of these errors are
>>>>> fixed
>>>>> by
>>>>> this patch.
>>>>>
>>>>> All the best,
>>>>> Jason
>>>>>
>>>>>
>>>>> On Thu, May 30, 2013 at 2:46 PM, filip fratev <filipfratev.yahoo.com>
>>>>> wrote:
>>>>>
>>>>> > I have observed the same crashes from time to time. I will run
>>>>> cellulose
>>>>> > nve for 100k and will past results here.
>>>>> >
>>>>> > All the best,
>>>>> > Filip
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > ______________________________**__
>>>>> > From: Scott Le Grand <varelse2005.gmail.com>
>>>>> > To: AMBER Mailing List <amber.ambermd.org>
>>>>> > Sent: Thursday, May 30, 2013 9:01 PM
>>>>> > Subject: Re: [AMBER] experiences with EVGA GTX TITAN Superclocked -
>>>>> > memtestG80 - UNDERclocking in Linux ?
>>>>> >
>>>>> >
>>>>> > Run cellulose nve for 100k iterations twice . If the final
>>>>> energies
>>>>> don't
>>>>> > match, you have a hardware issue. No need to play with ntpr or any
>>>>> other
>>>>> > variable.
>>>>> > On May 30, 2013 10:58 AM, <pavel.banas.upol.cz> wrote:
>>>>> >
>>>>> > >
>>>>> > > Dear all,
>>>>> > >
>>>>> > > I would also like to share one of my experience with titan
>>>>> cards. We
>>>>> have
>>>>> > > one gtx titan card and with one system (~55k atoms, NVT,
>>>>> RNA+waters)
>>>>> we
>>>>> > run
>>>>> > > into same troubles you are describing. I was also playing with
>>>>> ntpr
>>>>> to
>>>>> > > figure out what is going on, step by step. I understand that the
>>>>> code
>>>>> is
>>>>> > > using different routines for calculation energies+forces or only
>>>>> forces.
>>>>> > > The
>>>>> > > simulations of other systems are perfectly stable, running for
>>>>> days
>>>>> and
>>>>> > > weeks. Only that particular system systematically ends up with
>>>>> this
>>>>> > error.
>>>>> > >
>>>>> > > However, there was one interesting issue. When I set ntpr=1, the
>>>>> error
>>>>> > > vanished (systematically in multiple runs) and the simulation was
>>>>> able to
>>>>> > > run for more than millions of steps (I was not let it running for
>>>>> weeks
>>>>> > as
>>>>> > > in the meantime I shifted that simulation to other card - need
>>>>> data,
>>>>> not
>>>>> > > testing). All other setting of ntpr failed. As I read this
>>>>> discussion, I
>>>>> > > tried to set ene_avg_sampling=1 with some high value of ntpr (I
>>>>> expected
>>>>> > > that this will shift the code to permanently use the
>>>>> force+energies
>>>>> part
>>>>> > of
>>>>> > > the code, similarly to ntpr=1), but the error occurred again.
>>>>> > >
>>>>> > > I know it is not very conclusive for finding out what is
>>>>> happening,
>>>>> at
>>>>> > > least
>>>>> > > not for me. Do you have any idea, why ntpr=1 might help?
>>>>> > >
>>>>> > > best regards,
>>>>> > >
>>>>> > > Pavel
>>>>> > >
>>>>> > >
>>>>> > >
>>>>> > >
>>>>> > >
>>>>> > > --
>>>>> > > Pavel Banáš
>>>>> > > pavel.banas.upol.cz
>>>>> > > Department of Physical Chemistry,
>>>>> > > Palacky University Olomouc
>>>>> > > Czech Republic
>>>>> > >
>>>>> > >
>>>>> > >
>>>>> > > ---------- Původní zpráva ----------
>>>>> > > Od: Jason Swails <jason.swails.gmail.com>
>>>>> > > Datum: 29. 5. 2013
>>>>> > > Předmět: Re: [AMBER] experiences with EVGA GTX TITAN
>>>>> Superclocked -
>>>>> > > memtestG
>>>>> > > 80 - UNDERclocking in Linux ?
>>>>> > >
>>>>> > > "I'll answer a little bit:
>>>>> > >
>>>>> > > NTPR=10 Etot after 2000 steps
>>>>> > > >
>>>>> > > > -443256.6711
>>>>> > > > -443256.6711
>>>>> > > >
>>>>> > > > NTPR=200 Etot after 2000 steps
>>>>> > > >
>>>>> > > > -443261.0705
>>>>> > > > -443261.0705
>>>>> > > >
>>>>> > > > Any idea why energies should depend on frequency of energy
>>>>> records
>>>>> > (NTPR)
>>>>> > > ?
>>>>> > > >
>>>>> > >
>>>>> > > It is a subtle point, but the answer is 'different code paths.'
>>>>> In
>>>>> > > general, it is NEVER necessary to compute the actual energy of a
>>>>> molecule
>>>>> > > during the course of standard molecular dynamics (by analogy, it
>>>>> is
>>>>> NEVER
>>>>> > > necessary to compute atomic forces during the course of random
>>>>> Monte
>>>>> > Carlo
>>>>> > > sampling).
>>>>> > >
>>>>> > > For performance's sake, then, pmemd.cuda computes only the force
>>>>> when
>>>>> > > energies are not requested, leading to a different order of
>>>>> operations
>>>>> > for
>>>>> > > those runs. This difference ultimately causes divergence.
>>>>> > >
>>>>> > > To test this, try setting the variable ene_avg_sampling=10 in the
>>>>> &cntrl
>>>>> > > section. This will force pmemd.cuda to compute energies every 10
>>>>> steps
>>>>> > > (for energy averaging), which will in turn make the followed code
>>>>> path
>>>>> > > identical for any multiple-of-10 value of ntpr.
>>>>> > >
>>>>> > > --
>>>>> > > Jason M. Swails
>>>>> > > Quantum Theory Project,
>>>>> > > University of Florida
>>>>> > > Ph.D. Candidate
>>>>> > > 352-392-4032
>>>>> > > ______________________________**_________________
>>>>> > > AMBER mailing list
>>>>> > > AMBER.ambermd.org
>>>>> > >
>>>>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>> "
>>>>> > > ______________________________**_________________
>>>>> > > AMBER mailing list
>>>>> > > AMBER.ambermd.org
>>>>> > >
>>>>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>> > >
>>>>> > ______________________________**_________________
>>>>> > AMBER mailing list
>>>>> > AMBER.ambermd.org
>>>>> >
>>>>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>> > ______________________________**_________________
>>>>> > AMBER mailing list
>>>>> > AMBER.ambermd.org
>>>>> >
>>>>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jason M. Swails
>>>>> Quantum Theory Project,
>>>>> University of Florida
>>>>> Ph.D. Candidate
>>>>> 352-392-4032
>>>>> ______________________________**_________________
>>>>> AMBER mailing list
>>>>> AMBER.ambermd.org
>>>>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>>
>>>>>
>>>> ______________________________**_________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>
>>> __________ Informace od ESET NOD32 Antivirus, verze databaze 8394
>>> (20130530) __________
>>>
>>> Tuto zpravu proveril ESET NOD32 Antivirus.
>>>
>>> http://www.eset.cz
>>>
>>>
>>>
>>>
>>
>> --
>> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>> http://www.opera.com/mail/
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
> __________ Informace od ESET NOD32 Antivirus, verze databaze 8394
> (20130530) __________
>
> Tuto zpravu proveril ESET NOD32 Antivirus.
>
> http://www.eset.cz
>
>
>


-- 
Tato zpráva byla vytvořena převratným poštovním klientem Opery:  
http://www.opera.com/mail/
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu May 30 2013 - 16:30:02 PDT
Custom Search