Hi,
This is what I obtained for 50K tests and "normal" GTXTitan:
run1:
------------------------------------------------------------------------------
A V E R A G E S O V E R 50 S T E P S
NSTEP = 50000 TIME(PS) = 120.020 TEMP(K) = 299.87 PRESS = 0.0
Etot = -443237.1079 EKtot = 257679.9750 EPtot = -700917.0829
BOND = 20193.1856 ANGLE = 53517.5432 DIHED = 23575.4648
1-4 NB = 21759.5524 1-4 EEL = 742552.5939 VDWAALS = 96286.7714
EELEC = -1658802.1941 EHBOND = 0.0000 RESTRAINT = 0.0000
------------------------------------------------------------------------------
R M S F L U C T U A T I O N S
NSTEP = 50000 TIME(PS) = 120.020 TEMP(K) = 0.33 PRESS = 0.0
Etot = 11.2784 EKtot = 284.8999 EPtot = 289.0773
BOND = 136.3417 ANGLE = 214.0054 DIHED = 59.4893
1-4 NB = 58.5891 1-4 EEL = 330.5400 VDWAALS = 559.2079
EELEC = 743.8771 EHBOND = 0.0000 RESTRAINT = 0.0000
|E(PBS) = 21.8119
------------------------------------------------------------------------------
run2:
------------------------------------------------------------------------------
A V E R A G E S O V E R 50 S T E P S
NSTEP = 50000 TIME(PS) = 120.020 TEMP(K) = 299.89 PRESS = 0.0
Etot = -443240.0999 EKtot = 257700.0950 EPtot = -700940.1949
BOND = 20241.9174 ANGLE = 53644.6694 DIHED = 23541.3737
1-4 NB = 21803.1898 1-4 EEL = 742754.2254 VDWAALS = 96298.8308
EELEC = -1659224.4013 EHBOND = 0.0000 RESTRAINT = 0.0000
------------------------------------------------------------------------------
R M S F L U C T U A T I O N S
NSTEP = 50000 TIME(PS) = 120.020 TEMP(K) = 0.41 PRESS = 0.0
Etot = 10.7633 EKtot = 348.2819 EPtot = 353.9918
BOND = 106.5314 ANGLE = 196.7052 DIHED = 69.7476
1-4 NB = 60.3435 1-4 EEL = 400.7466 VDWAALS = 462.7763
EELEC = 651.9857 EHBOND = 0.0000 RESTRAINT = 0.0000
|E(PBS) = 17.0642
------------------------------------------------------------------------------
--------------------------------------------------------------------------------
________________________________
From: Marek Maly <marek.maly.ujep.cz>
To: AMBER Mailing List <amber.ambermd.org>
Sent: Friday, May 31, 2013 3:34 PM
Subject: Re: [AMBER] experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ?
Hi here are my 100K results for driver 313.30 (and still Cuda 5.0).
The results are rather similar to those obtained
under my original driver 319.17 (see the first table
which I sent in this thread).
M.
Dne Fri, 31 May 2013 12:29:59 +0200 Marek Maly <marek.maly.ujep.cz>
napsal/-a:
> Hi,
>
> please try to run at lest 100K tests twice to verify exact
> reproducibility
> of the results on the given card. If you find in any mdin file ig=-1 just
> delete it to ensure that you are using the identical random seed for both
> runs. You can eventually omit NUCLEOSOME test
> as it is too much time consuming.
>
> Driver 310.44 ?????
>
> As far as I know the proper support for titans is from version 313.26
>
> see e.g. here :
> http://www.geeks3d.com/20130306/nvidia-releases-r313-26-for-linux-with-gtx-titan-support/
>
> BTW: On my site downgrade to drv. 313.30 did not solved the situation, I
> will post
> my results soon here.
>
> M.
>
>
>
>
>
>
>
>
> Dne Fri, 31 May 2013 12:21:21 +0200 ET <sketchfoot.gmail.com> napsal/-a:
>
>> ps. I have another install of amber on another computer with a different
>> Titan and different Driver Version: 310.44.
>>
>> In the interests of thrashing the proverbial horse, I'll run the
>> benchmark
>> for 50k steps. :P
>>
>> br,
>> g
>>
>>
>> On 31 May 2013 11:17, ET <sketchfoot.gmail.com> wrote:
>>
>>> Hi, I just ran the Amber benchmark for the default (10000 steps) on my
>>> Titan.
>>>
>>> Using sdiff -sB showed that the two runs were completely identical.
>>> I've
>>> attached compressed files of the mdout & diff files.
>>>
>>> br,
>>> g
>>>
>>>
>>> On 30 May 2013 23:41, Marek Maly <marek.maly.ujep.cz> wrote:
>>>
>>>> OK, let's see. The eventual downclocking I see as the very last
>>>> possibility
>>>> (if I don't decide for RMAing). But now still some other experiments
>>>> are
>>>> available :))
>>>> I just started 100K tests under 313.30 driver. For today good night
>>>> ...
>>>>
>>>> M.
>>>>
>>>> Dne Fri, 31 May 2013 00:45:49 +0200 Scott Le Grand
>>>> <varelse2005.gmail.com
>>>> >
>>>> napsal/-a:
>>>>
>>>> > It will be very interesting if this behavior persists after
>>>> downclocking.
>>>> >
>>>> > But right now, Titan 0 *looks* hosed and Titan 1 *looks* like it
>>>> needs
>>>> > downclocking...
>>>> > On May 30, 2013 3:20 PM, "Marek Maly" <marek.maly.ujep.cz> wrote:
>>>> >
>>>> >> Hi all,
>>>> >>
>>>> >> here are my results from the 500K steps 2 x repeated benchmarks
>>>> >> under 319.23 driver and still Cuda 5.0 (see the attached table ).
>>>> >>
>>>> >> It is hard to say if the results are better or worse than in my
>>>> >> previous 100K test under driver 319.17.
>>>> >>
>>>> >> While results from Cellulose test were improved and the TITAN_1
>>>> card
>>>> >> even
>>>> >> successfully finished all 500K steps moreover with exactly the same
>>>> >> final
>>>> >> energy !
>>>> >> (TITAN_0 at least finished more than 100K steps and in RUN_01 even
>>>> more
>>>> >> than 400K steps)
>>>> >> In JAC_NPT test no GPU was able to finish at least 100K steps and
>>>> the
>>>> >> results from JAC_NVE
>>>> >> test are also not too much convincing. FACTOR_IX_NVE and
>>>> FACTOR_IX_NPT
>>>> >> were successfully
>>>> >> finished with 100% reproducibility in FACTOR_IX_NPT case (on both
>>>> >> cards)
>>>> >> and almost
>>>> >> 100% reproducibility in case of FACTOR_IX_NVE (again 100% in case
>>>> of
>>>> >> TITAN_1). TRPCAGE, MYOGLOBIN
>>>> >> again finished without any problem with 100% reproducibility.
>>>> NUCLEOSOME
>>>> >> test was not done
>>>> >> this time due to high time requirements. If you find in the table
>>>> >> positive
>>>> >> number finishing with
>>>> >> K (which means "thousands") it means the last number of step
>>>> written in
>>>> >> mdout before crash.
>>>> >> Below are all the 3 types of detected errs with relevant
>>>> systems/rounds
>>>> >> where the given err
>>>> >> appeared.
>>>> >>
>>>> >> Now I will try just 100K tests under ETs favourite driver version
>>>> 313.30
>>>> >> :)) and then
>>>> >> I will eventually try to experiment with cuda 5.5 which I already
>>>> >> downloaded from the
>>>> >> cuda zone ( I had to become cuda developer for this :)) ) BTW ET
>>>> thanks
>>>> >> for the frequency info !
>>>> >> and I am still ( perhaps not alone :)) ) very curious about your 2
>>>> x
>>>> >> repeated Amber benchmark tests with superclocked Titan. Indeed that
>>>> I
>>>> am
>>>> >> very curious also about that Ross "hot" patch.
>>>> >>
>>>> >> M.
>>>> >>
>>>> >> ERRORS DETECTED DURING THE 500K steps tests with driver 319.23
>>>> >>
>>>> >> #1 ERR writtent in mdout:
>>>> >> ------
>>>> >> | ERROR: max pairlist cutoff must be less than unit cell max
>>>> sphere
>>>> >> radius!
>>>> >> ------
>>>> >>
>>>> >> TITAN_0 ROUND_1 JAC_NPT (at least 5000 steps successfully done
>>>> before
>>>> >> crash)
>>>> >> TITAN_0 ROUND_2 JAC_NPT (at least 8000 steps successfully done
>>>> before
>>>> >> crash)
>>>> >>
>>>> >>
>>>> >> #2 no ERR writtent in mdout, ERR written in standard output
>>>> (nohup.out)
>>>> >>
>>>> >> ----
>>>> >> Error: unspecified launch failure launching kernel kNLSkinTest
>>>> >> cudaFree GpuBuffer::Deallocate failed unspecified launch failure
>>>> >> ----
>>>> >>
>>>> >> TITAN_0 ROUND_1 CELLULOSE_NVE (at least 437 000 steps successfully
>>>> done
>>>> >> before crash)
>>>> >> TITAN_0 ROUND_2 JAC_NVE (at least 162 000 steps successfully done
>>>> >> before
>>>> >> crash)
>>>> >> TITAN_0 ROUND_2 CELLULOSE_NVE (at least 117 000 steps successfully
>>>> done
>>>> >> before crash)
>>>> >> TITAN_1 ROUND_1 JAC_NVE (at least 119 000 steps successfully done
>>>> >> before
>>>> >> crash)
>>>> >> TITAN_1 ROUND_2 JAC_NVE (at least 43 000 steps successfully done
>>>> before
>>>> >> crash)
>>>> >>
>>>> >>
>>>> >> #3 no ERR writtent in mdout, ERR written in standard output
>>>> (nohup.out)
>>>> >> ----
>>>> >> cudaMemcpy GpuBuffer::Download failed unspecified launch failure
>>>> >> ----
>>>> >>
>>>> >> TITAN_1 ROUND_1 JAC_NPT (at least 77 000 steps successfully done
>>>> before
>>>> >> crash)
>>>> >> TITAN_1 ROUND_2 JAC_NPT (at least 58 000 steps successfully done
>>>> before
>>>> >> crash)
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> Dne Thu, 30 May 2013 21:27:17 +0200 Scott Le Grand
>>>> >> <varelse2005.gmail.com>
>>>> >> napsal/-a:
>>>> >>
>>>> >> Oops meant to send that to Jason...
>>>> >>>
>>>> >>> Anyway, before we all panic, we need to get K20's behavior
>>>> analyzed
>>>> >>> here.
>>>> >>> If it's deterministic, this truly is a hardware issue. If not,
>>>> then
>>>> it
>>>> >>> gets interesting because 680 is deterministic as far as I can
>>>> tell...
>>>> >>> On May 30, 2013 12:24 PM, "Scott Le Grand" <varelse2005.gmail.com>
>>>> >>> wrote:
>>>> >>>
>>>> >>> If the errors are not deterministically triggered, they probably
>>>> >>> won't be
>>>> >>>> fixed by the patch, alas...
>>>> >>>> On May 30, 2013 12:15 PM, "Jason Swails" <jason.swails.gmail.com>
>>>> >>>> wrote:
>>>> >>>>
>>>> >>>> Just a reminder to everyone based on what Ross said: there is a
>>>> >>>> pending
>>>> >>>>> patch to pmemd.cuda that will be coming out shortly (maybe even
>>>> >>>>> within
>>>> >>>>> hours). It's entirely possible that several of these errors are
>>>> >>>>> fixed
>>>> >>>>> by
>>>> >>>>> this patch.
>>>> >>>>>
>>>> >>>>> All the best,
>>>> >>>>> Jason
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> On Thu, May 30, 2013 at 2:46 PM, filip fratev <
>>>> filipfratev.yahoo.com>
>>>> >>>>> wrote:
>>>> >>>>>
>>>> >>>>> > I have observed the same crashes from time to time. I will run
>>>> >>>>> cellulose
>>>> >>>>> > nve for 100k and will past results here.
>>>> >>>>> >
>>>> >>>>> > All the best,
>>>> >>>>> > Filip
>>>> >>>>> >
>>>> >>>>> >
>>>> >>>>> >
>>>> >>>>> >
>>>> >>>>> > ______________________________**__
>>>> >>>>> > From: Scott Le Grand <varelse2005.gmail.com>
>>>> >>>>> > To: AMBER Mailing List <amber.ambermd.org>
>>>> >>>>> > Sent: Thursday, May 30, 2013 9:01 PM
>>>> >>>>> > Subject: Re: [AMBER] experiences with EVGA GTX TITAN
>>>> Superclocked
>>>> -
>>>> >>>>> > memtestG80 - UNDERclocking in Linux ?
>>>> >>>>> >
>>>> >>>>> >
>>>> >>>>> > Run cellulose nve for 100k iterations twice . If the final
>>>> >>>>> energies
>>>> >>>>> don't
>>>> >>>>> > match, you have a hardware issue. No need to play with ntpr
>>>> or
>>>> any
>>>> >>>>> other
>>>> >>>>> > variable.
>>>> >>>>> > On May 30, 2013 10:58 AM, <pavel.banas.upol.cz> wrote:
>>>> >>>>> >
>>>> >>>>> > >
>>>> >>>>> > > Dear all,
>>>> >>>>> > >
>>>> >>>>> > > I would also like to share one of my experience with titan
>>>> >>>>> cards. We
>>>> >>>>> have
>>>> >>>>> > > one gtx titan card and with one system (~55k atoms, NVT,
>>>> >>>>> RNA+waters)
>>>> >>>>> we
>>>> >>>>> > run
>>>> >>>>> > > into same troubles you are describing. I was also playing
>>>> with
>>>> >>>>> ntpr
>>>> >>>>> to
>>>> >>>>> > > figure out what is going on, step by step. I understand that
>>>> the
>>>> >>>>> code
>>>> >>>>> is
>>>> >>>>> > > using different routines for calculation energies+forces or
>>>> only
>>>> >>>>> forces.
>>>> >>>>> > > The
>>>> >>>>> > > simulations of other systems are perfectly stable, running
>>>> for
>>>> >>>>> days
>>>> >>>>> and
>>>> >>>>> > > weeks. Only that particular system systematically ends up
>>>> with
>>>> >>>>> this
>>>> >>>>> > error.
>>>> >>>>> > >
>>>> >>>>> > > However, there was one interesting issue. When I set ntpr=1,
>>>> the
>>>> >>>>> error
>>>> >>>>> > > vanished (systematically in multiple runs) and the
>>>> simulation
>>>> was
>>>> >>>>> able to
>>>> >>>>> > > run for more than millions of steps (I was not let it
>>>> running
>>>> for
>>>> >>>>> weeks
>>>> >>>>> > as
>>>> >>>>> > > in the meantime I shifted that simulation to other card -
>>>> need
>>>> >>>>> data,
>>>> >>>>> not
>>>> >>>>> > > testing). All other setting of ntpr failed. As I read this
>>>> >>>>> discussion, I
>>>> >>>>> > > tried to set ene_avg_sampling=1 with some high value of ntpr
>>>> (I
>>>> >>>>> expected
>>>> >>>>> > > that this will shift the code to permanently use the
>>>> >>>>> force+energies
>>>> >>>>> part
>>>> >>>>> > of
>>>> >>>>> > > the code, similarly to ntpr=1), but the error occurred
>>>> again.
>>>> >>>>> > >
>>>> >>>>> > > I know it is not very conclusive for finding out what is
>>>> >>>>> happening,
>>>> >>>>> at
>>>> >>>>> > > least
>>>> >>>>> > > not for me. Do you have any idea, why ntpr=1 might help?
>>>> >>>>> > >
>>>> >>>>> > > best regards,
>>>> >>>>> > >
>>>> >>>>> > > Pavel
>>>> >>>>> > >
>>>> >>>>> > >
>>>> >>>>> > >
>>>> >>>>> > >
>>>> >>>>> > >
>>>> >>>>> > > --
>>>> >>>>> > > Pavel Banáš
>>>> >>>>> > > pavel.banas.upol.cz
>>>> >>>>> > > Department of Physical Chemistry,
>>>> >>>>> > > Palacky University Olomouc
>>>> >>>>> > > Czech Republic
>>>> >>>>> > >
>>>> >>>>> > >
>>>> >>>>> > >
>>>> >>>>> > > ---------- Původní zpráva ----------
>>>> >>>>> > > Od: Jason Swails <jason.swails.gmail.com>
>>>> >>>>> > > Datum: 29. 5. 2013
>>>> >>>>> > > Předmět: Re: [AMBER] experiences with EVGA GTX TITAN
>>>> >>>>> Superclocked -
>>>> >>>>> > > memtestG
>>>> >>>>> > > 80 - UNDERclocking in Linux ?
>>>> >>>>> > >
>>>> >>>>> > > "I'll answer a little bit:
>>>> >>>>> > >
>>>> >>>>> > > NTPR=10 Etot after 2000 steps
>>>> >>>>> > > >
>>>> >>>>> > > > -443256.6711
>>>> >>>>> > > > -443256.6711
>>>> >>>>> > > >
>>>> >>>>> > > > NTPR=200 Etot after 2000 steps
>>>> >>>>> > > >
>>>> >>>>> > > > -443261.0705
>>>> >>>>> > > > -443261.0705
>>>> >>>>> > > >
>>>> >>>>> > > > Any idea why energies should depend on frequency of energy
>>>> >>>>> records
>>>> >>>>> > (NTPR)
>>>> >>>>> > > ?
>>>> >>>>> > > >
>>>> >>>>> > >
>>>> >>>>> > > It is a subtle point, but the answer is 'different code
>>>> paths.'
>>>> >>>>> In
>>>> >>>>> > > general, it is NEVER necessary to compute the actual energy
>>>> of a
>>>> >>>>> molecule
>>>> >>>>> > > during the course of standard molecular dynamics (by
>>>> analogy, it
>>>> >>>>> is
>>>> >>>>> NEVER
>>>> >>>>> > > necessary to compute atomic forces during the course of
>>>> random
>>>> >>>>> Monte
>>>> >>>>> > Carlo
>>>> >>>>> > > sampling).
>>>> >>>>> > >
>>>> >>>>> > > For performance's sake, then, pmemd.cuda computes only the
>>>> force
>>>> >>>>> when
>>>> >>>>> > > energies are not requested, leading to a different order of
>>>> >>>>> operations
>>>> >>>>> > for
>>>> >>>>> > > those runs. This difference ultimately causes divergence.
>>>> >>>>> > >
>>>> >>>>> > > To test this, try setting the variable ene_avg_sampling=10
>>>> in
>>>> the
>>>> >>>>> &cntrl
>>>> >>>>> > > section. This will force pmemd.cuda to compute energies
>>>> every 10
>>>> >>>>> steps
>>>> >>>>> > > (for energy averaging), which will in turn make the followed
>>>> code
>>>> >>>>> path
>>>> >>>>> > > identical for any multiple-of-10 value of ntpr.
>>>> >>>>> > >
>>>> >>>>> > > --
>>>> >>>>> > > Jason M. Swails
>>>> >>>>> > > Quantum Theory Project,
>>>> >>>>> > > University of Florida
>>>> >>>>> > > Ph.D. Candidate
>>>> >>>>> > > 352-392-4032
>>>> >>>>> > > ______________________________**_________________
>>>> >>>>> > > AMBER mailing list
>>>> >>>>> > > AMBER.ambermd.org
>>>> >>>>> > >
>>>> >>>>> http://lists.ambermd.org/**mailman/listinfo/amber<
>>>> http://lists.ambermd.org/mailman/listinfo/amber>
>>>> >>>>> "
>>>> >>>>> > > ______________________________**_________________
>>>> >>>>> > > AMBER mailing list
>>>> >>>>> > > AMBER.ambermd.org
>>>> >>>>> > >
>>>> >>>>> http://lists.ambermd.org/**mailman/listinfo/amber<
>>>> http://lists.ambermd.org/mailman/listinfo/amber>
>>>> >>>>> > >
>>>> >>>>> > ______________________________**_________________
>>>> >>>>> > AMBER mailing list
>>>> >>>>> > AMBER.ambermd.org
>>>> >>>>> >
>>>> >>>>> http://lists.ambermd.org/**mailman/listinfo/amber<
>>>> http://lists.ambermd.org/mailman/listinfo/amber>
>>>> >>>>> > ______________________________**_________________
>>>> >>>>> > AMBER mailing list
>>>> >>>>> > AMBER.ambermd.org
>>>> >>>>> >
>>>> >>>>> http://lists.ambermd.org/**mailman/listinfo/amber<
>>>> http://lists.ambermd.org/mailman/listinfo/amber>
>>>> >>>>> >
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> --
>>>> >>>>> Jason M. Swails
>>>> >>>>> Quantum Theory Project,
>>>> >>>>> University of Florida
>>>> >>>>> Ph.D. Candidate
>>>> >>>>> 352-392-4032
>>>> >>>>> ______________________________**_________________
>>>> >>>>> AMBER mailing list
>>>> >>>>> AMBER.ambermd.org
>>>> >>>>> http://lists.ambermd.org/**mailman/listinfo/amber<
>>>> http://lists.ambermd.org/mailman/listinfo/amber>
>>>> >>>>>
>>>> >>>>>
>>>> >>>> ______________________________**_________________
>>>> >>> AMBER mailing list
>>>> >>> AMBER.ambermd.org
>>>> >>> http://lists.ambermd.org/**mailman/listinfo/amber<
>>>> http://lists.ambermd.org/mailman/listinfo/amber>
>>>> >>>
>>>> >>> __________ Informace od ESET NOD32 Antivirus, verze databaze 8394
>>>> >>> (20130530) __________
>>>> >>>
>>>> >>> Tuto zpravu proveril ESET NOD32 Antivirus.
>>>> >>>
>>>> >>> http://www.eset.cz
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>
>>>> >> --
>>>> >> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>>>> >> http://www.opera.com/mail/
>>>> >> _______________________________________________
>>>> >> AMBER mailing list
>>>> >> AMBER.ambermd.org
>>>> >> http://lists.ambermd.org/mailman/listinfo/amber
>>>> >>
>>>> >>
>>>> > _______________________________________________
>>>> > AMBER mailing list
>>>> > AMBER.ambermd.org
>>>> > http://lists.ambermd.org/mailman/listinfo/amber
>>>> >
>>>> > __________ Informace od ESET NOD32 Antivirus, verze databaze 8394
>>>> > (20130530) __________
>>>> >
>>>> > Tuto zpravu proveril ESET NOD32 Antivirus.
>>>> >
>>>> > http://www.eset.cz
>>>> >
>>>> >
>>>> >
>>>>
>>>>
>>>> --
>>>> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>>>> http://www.opera.com/mail/
>>>>
>>>> _______________________________________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org
>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>
>>>
>>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>> __________ Informace od ESET NOD32 Antivirus, verze databaze 8395
>> (20130531) __________
>>
>> Tuto zpravu proveril ESET NOD32 Antivirus.
>>
>> http://www.eset.cz
>>
>>
>>
>
>
--
Tato zpráva byla vytvořena převratným poštovním klientem Opery:
http://www.opera.com/mail/
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri May 31 2013 - 09:00:02 PDT