Re: [AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ?

From: Marek Maly <marek.maly.ujep.cz>
Date: Fri, 31 May 2013 00:00:21 +0200

Hi all,

here are my results from the 500K steps 2 x repeated benchmarks
under 319.23 driver and still Cuda 5.0 (see the attached table ).

It is hard to say if the results are better or worse than in my
previous 100K test under driver 319.17.

While results from Cellulose test were improved and the TITAN_1 card even
successfully finished all 500K steps moreover with exactly the same final
energy !
(TITAN_0 at least finished more than 100K steps and in RUN_01 even more
than 400K steps)
In JAC_NPT test no GPU was able to finish at least 100K steps and the
results from JAC_NVE
test are also not too much convincing. FACTOR_IX_NVE and FACTOR_IX_NPT
were successfully
finished with 100% reproducibility in FACTOR_IX_NPT case (on both cards)
and almost
100% reproducibility in case of FACTOR_IX_NVE (again 100% in case of
TITAN_1). TRPCAGE, MYOGLOBIN
again finished without any problem with 100% reproducibility. NUCLEOSOME
test was not done
this time due to high time requirements. If you find in the table positive
number finishing with
K (which means "thousands") it means the last number of step written in
mdout before crash.
Below are all the 3 types of detected errs with relevant systems/rounds
where the given err
appeared.

Now I will try just 100K tests under ETs favourite driver version 313.30
:)) and then
I will eventually try to experiment with cuda 5.5 which I already
downloaded from the
cuda zone ( I had to become cuda developer for this :)) ) BTW ET thanks
for the frequency info !
and I am still ( perhaps not alone :)) ) very curious about your 2 x
repeated Amber benchmark tests with superclocked Titan. Indeed that I am
very curious also about that Ross "hot" patch.

   M.

ERRORS DETECTED DURING THE 500K steps tests with driver 319.23

#1 ERR writtent in mdout:
------
| ERROR: max pairlist cutoff must be less than unit cell max sphere
radius!
------

TITAN_0 ROUND_1 JAC_NPT (at least 5000 steps successfully done before
crash)
TITAN_0 ROUND_2 JAC_NPT (at least 8000 steps successfully done before
crash)


#2 no ERR writtent in mdout, ERR written in standard output (nohup.out)

----
Error: unspecified launch failure launching kernel kNLSkinTest
cudaFree GpuBuffer::Deallocate failed unspecified launch failure
----
TITAN_0 ROUND_1 CELLULOSE_NVE (at least 437 000 steps successfully done  
before crash)
TITAN_0 ROUND_2 JAC_NVE  (at least 162 000 steps successfully done before  
crash)
TITAN_0 ROUND_2 CELLULOSE_NVE (at least 117 000 steps successfully done  
before crash)
TITAN_1 ROUND_1 JAC_NVE  (at least 119 000 steps successfully done before  
crash)
TITAN_1 ROUND_2 JAC_NVE  (at least 43 000 steps successfully done before  
crash)
#3 no ERR writtent in mdout, ERR written in standard output (nohup.out)
----
cudaMemcpy GpuBuffer::Download failed unspecified launch failure
----
TITAN_1 ROUND_1 JAC_NPT  (at least 77 000 steps successfully done before  
crash)
TITAN_1 ROUND_2 JAC_NPT  (at least 58 000 steps successfully done before  
crash)
Dne Thu, 30 May 2013 21:27:17 +0200 Scott Le Grand <varelse2005.gmail.com>  
napsal/-a:
> Oops meant to send that to Jason...
>
> Anyway, before we all panic, we need to get K20's behavior analyzed here.
> If it's deterministic, this truly is a hardware issue.  If not, then it
> gets interesting because 680 is deterministic as far as I can tell...
> On May 30, 2013 12:24 PM, "Scott Le Grand" <varelse2005.gmail.com> wrote:
>
>> If the errors are not deterministically triggered, they probably won't  
>> be
>> fixed by the patch, alas...
>> On May 30, 2013 12:15 PM, "Jason Swails" <jason.swails.gmail.com> wrote:
>>
>>> Just a reminder to everyone based on what Ross said: there is a pending
>>> patch to pmemd.cuda that will be coming out shortly (maybe even within
>>> hours).  It's entirely possible that several of these errors are fixed  
>>> by
>>> this patch.
>>>
>>> All the best,
>>> Jason
>>>
>>>
>>> On Thu, May 30, 2013 at 2:46 PM, filip fratev <filipfratev.yahoo.com>
>>> wrote:
>>>
>>> > I have observed the same crashes from time to time. I will run
>>>  cellulose
>>> > nve for 100k and will past results here.
>>> >
>>> > All the best,
>>> > Filip
>>> >
>>> >
>>> >
>>> >
>>> > ________________________________
>>> >  From: Scott Le Grand <varelse2005.gmail.com>
>>> > To: AMBER Mailing List <amber.ambermd.org>
>>> > Sent: Thursday, May 30, 2013 9:01 PM
>>> > Subject: Re: [AMBER] experiences with EVGA GTX TITAN Superclocked -
>>> > memtestG80 - UNDERclocking in Linux ?
>>> >
>>> >
>>> > Run cellulose nve for 100k iterations twice .  If the final energies
>>> don't
>>> > match, you have a hardware issue.  No need to play with ntpr or any
>>> other
>>> > variable.
>>> > On May 30, 2013 10:58 AM, <pavel.banas.upol.cz> wrote:
>>> >
>>> > >
>>> > > Dear all,
>>> > >
>>> > > I would also like to share one of my experience with titan cards.  
>>> We
>>> have
>>> > > one gtx titan card and with one system (~55k atoms, NVT,  
>>> RNA+waters)
>>> we
>>> > run
>>> > > into same troubles you are describing. I was also playing with  
>>> ntpr to
>>> > > figure out what is going on, step by step. I understand that the  
>>> code
>>> is
>>> > > using different routines for calculation energies+forces or only
>>> forces.
>>> > > The
>>> > > simulations of other systems are perfectly stable, running for days
>>> and
>>> > > weeks. Only that particular system systematically ends up with this
>>> > error.
>>> > >
>>> > > However, there was one interesting issue. When I set ntpr=1, the  
>>> error
>>> > > vanished (systematically in multiple runs) and the simulation was
>>> able to
>>> > > run for more than millions of steps (I was not let it running for
>>> weeks
>>> > as
>>> > > in the meantime I shifted that simulation to other card - need  
>>> data,
>>> not
>>> > > testing). All other setting of ntpr failed. As I read this
>>> discussion, I
>>> > > tried to set ene_avg_sampling=1 with some high value of ntpr (I
>>> expected
>>> > > that this will shift the code to permanently use the force+energies
>>> part
>>> > of
>>> > > the code, similarly to ntpr=1), but the error occurred again.
>>> > >
>>> > > I know it is not very conclusive for finding out what is  
>>> happening, at
>>> > > least
>>> > > not for me. Do you have any idea, why ntpr=1 might help?
>>> > >
>>> > > best regards,
>>> > >
>>> > > Pavel
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > --
>>> > > Pavel Banáš
>>> > > pavel.banas.upol.cz
>>> > > Department of Physical Chemistry,
>>> > > Palacky University Olomouc
>>> > > Czech Republic
>>> > >
>>> > >
>>> > >
>>> > > ---------- Původní zpráva ----------
>>> > > Od: Jason Swails <jason.swails.gmail.com>
>>> > > Datum: 29. 5. 2013
>>> > > Předmět: Re: [AMBER] experiences with EVGA GTX TITAN Superclocked -
>>> > > memtestG
>>> > > 80 - UNDERclocking in Linux ?
>>> > >
>>> > > "I'll answer a little bit:
>>> > >
>>> > > NTPR=10 Etot after 2000 steps
>>> > > >
>>> > > > -443256.6711
>>> > > > -443256.6711
>>> > > >
>>> > > > NTPR=200 Etot after 2000 steps
>>> > > >
>>> > > > -443261.0705
>>> > > > -443261.0705
>>> > > >
>>> > > > Any idea why energies should depend on frequency of energy  
>>> records
>>> > (NTPR)
>>> > > ?
>>> > > >
>>> > >
>>> > > It is a subtle point, but the answer is 'different code paths.' In
>>> > > general, it is NEVER necessary to compute the actual energy of a
>>> molecule
>>> > > during the course of standard molecular dynamics (by analogy, it is
>>> NEVER
>>> > > necessary to compute atomic forces during the course of random  
>>> Monte
>>> > Carlo
>>> > > sampling).
>>> > >
>>> > > For performance's sake, then, pmemd.cuda computes only the force  
>>> when
>>> > > energies are not requested, leading to a different order of  
>>> operations
>>> > for
>>> > > those runs. This difference ultimately causes divergence.
>>> > >
>>> > > To test this, try setting the variable ene_avg_sampling=10 in the
>>> &cntrl
>>> > > section. This will force pmemd.cuda to compute energies every 10  
>>> steps
>>> > > (for energy averaging), which will in turn make the followed code  
>>> path
>>> > > identical for any multiple-of-10 value of ntpr.
>>> > >
>>> > > --
>>> > > Jason M. Swails
>>> > > Quantum Theory Project,
>>> > > University of Florida
>>> > > Ph.D. Candidate
>>> > > 352-392-4032
>>> > > _______________________________________________
>>> > > AMBER mailing list
>>> > > AMBER.ambermd.org
>>> > > http://lists.ambermd.org/mailman/listinfo/amber"
>>> > > _______________________________________________
>>> > > AMBER mailing list
>>> > > AMBER.ambermd.org
>>> > > http://lists.ambermd.org/mailman/listinfo/amber
>>> > >
>>> > _______________________________________________
>>> > AMBER mailing list
>>> > AMBER.ambermd.org
>>> > http://lists.ambermd.org/mailman/listinfo/amber
>>> > _______________________________________________
>>> > AMBER mailing list
>>> > AMBER.ambermd.org
>>> > http://lists.ambermd.org/mailman/listinfo/amber
>>> >
>>>
>>>
>>>
>>> --
>>> Jason M. Swails
>>> Quantum Theory Project,
>>> University of Florida
>>> Ph.D. Candidate
>>> 352-392-4032
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
> __________ Informace od ESET NOD32 Antivirus, verze databaze 8394  
> (20130530) __________
>
> Tuto zpravu proveril ESET NOD32 Antivirus.
>
> http://www.eset.cz
>
>
>
-- 
Tato zpráva byla vytvořena převratným poštovním klientem Opery:  
http://www.opera.com/mail/



Total energy at step 500 000 (driver 319.23)

*TITAN_0 JAC_NVE JAC_NPT FACTOR_IX_NVE FACTOR_IX_NPT CELLULOSE_NVE TRPCAGE MYOGLOBIN
ROUND_1 -58138.7320 5K -234184.3130 -234426.3859 437K -229.9465 -1378.5278
ROUND_2 162K 8K -234182.4210 -234426.3859 117K -229.9465 -1378.5278
*TITAN_1
ROUND_1 119K 77K -234182.4210 -234426.3859 -443260.9931 -229.9465 -1378.5278
ROUND_2 43K 58K -234182.4210 -234426.3859 -443260.9931 -229.9465 -1378.5278



CELLULOSE_NVE total energy at step 100K (in the framework of this 500K test)
*TITAN_0 CELLULOSE_NVE
ROUND_1 -443246.3206
ROUND_2 -443248.7307
*TITAN_1
ROUND_1 -443246.3206
ROUND_2 -443246.3206









_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu May 30 2013 - 15:30:02 PDT
Custom Search