And bingo...
At the very least, the reciprocal sum is intermittently inconsistent...
This explains the irreproducible behavior...
And here's the level of inconsistency:
31989.38940628897399 vs
31989.39168370794505
That's error at the level of 1e-7 or a somehow missed single-precision
transaction somewhere...
The next question is figuring out why... This may or may not ultimately
explain the crashes you guys are also seeing...
On Sun, Jun 2, 2013 at 9:07 AM, Scott Le Grand <varelse2005.gmail.com>wrote:
>
> Observations:
> 1. The degree to which the reproducibility is broken *does* appear to vary
> between individual Titan GPUs. One of my Titans breaks within 10K steps on
> cellulose, the other one made it to 100K steps twice without doing so
> leading me to believe it could be trusted (until yesterday where I now see
> it dies between 50K and 100K steps most of the time).
>
> 2. GB hasn't broken (yet). So could you run myoglobin for 500K and
> TRPcage for 1,000,000 steps and let's see if that's universal.
>
> 3. Turning on double-precision mode makes my Titan crash rather than run
> irreproducibly, sigh...
>
> So whatever is going on is triggered by something in PME but not GB. So
> that's either the radix sort, the FFT, the Ewald grid interpolation, or the
> neighbor list code. Fixing this involves isolating this and figuring out
> what exactly goes haywire. It could *still* be software at some very small
> probability but the combination of both 680 and K20c with ECC off running
> reliably is really pointing towards the Titans just being clocked too
> fast.
>
> So how long with this take? Asking people how long it takes to fix a bug
> never really works out well. That said, I found the 480 bug within a week
> and my usual turnaround for a bug with a solid repro is <24 hours.
>
> Scott
>
> On Sun, Jun 2, 2013 at 7:58 AM, Marek Maly <marek.maly.ujep.cz> wrote:
>
>> Hi all,
>>
>> here are my results after bugfix 18 application (see attachment).
>>
>> In principle I don't see any "drastical" changes.
>>
>> FACTOR_IX still perfectly stable/reproducible on both cards,
>>
>> JAC tests - problems with finishing AND/OR reproducibility the
>> same CELLULOSE_NVE although here it seems that my TITAN_1
>> has no problems with this test (but the same same trend I saw also
>> before bugfix 18 - see my older 500K steps test).
>>
>> But anyway bugfix 18 brought here one change.
>>
>> The err
>>
>>
>> #1 ERR writtent in mdout:
>> ------
>> | ERROR: max pairlist cutoff must be less than unit cell max sphere
>> radius!
>> ------
>>
>> was substituted with err/warning ?
>>
>> #0 no ERR writtent in mdout, ERR written in standard output (nohup.out)
>> -----
>> Nonbond cells need to be recalculated, restart simulation from previous
>> checkpoint
>> with a higher value for skinnb.
>>
>> -----
>>
>> Another thing,
>>
>> recently I started on another machine and GTX 580 GPU simulation of
>> relatively
>> big system ( 364275 atoms/PME ). The system is composed also from the
>> "exotic" molecules like polymers. ff12SB, gaff, GLYCAM forcefields used
>> here. I had problem even with minimization part here, having big energy
>> on the start:
>>
>> -----
>> NSTEP ENERGY RMS GMAX NAME NUMBER
>> 1 2.8442E+09 2.1339E+02 1.7311E+04 O 32998
>>
>> BOND = 11051.7467 ANGLE = 17720.4706 DIHED =
>> 18977.7584
>> VDWAALS = ************* EEL = -1257709.6203 HBOND =
>> 0.0000
>> 1-4 VDW = 7253.7412 1-4 EEL = 149867.0207 RESTRAINT =
>> 0.0000
>>
>> ----
>>
>> with no chance to minimize the system even with 50 000 steps in both
>> min cycles (with constrained and unconstrained solute) and hence heating
>> NVT
>> crashed immediately even with very small dt. I patched Amber12 here with
>> the
>> bugfix 18 and the minimization was done without any problem with common
>> 5000 steps
>> (obtaining target Energy -1.4505E+06 while that initial was that written
>> above).
>>
>> So indeed bugfix 18 solved some issues, but unfortunately not those
>> related to
>> Titans.
>>
>> Here I will try to install cuda 5.5, recompile GPU Amber part with this
>> new
>> cuda version and repeat the 100K tests.
>>
>> Scott, let us know how finished your experiment with downclocking of
>> Titan.
>> Maybe the best choice would be here to flash Titan directly with your
>> K20c bios :))
>>
>> M.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Dne Sat, 01 Jun 2013 21:09:46 +0200 Marek Maly <marek.maly.ujep.cz>
>> napsal/-a:
>>
>>
>> Hi,
>>>
>>> first of all thanks for providing of your test results !
>>>
>>> It seems that your results are more or less similar to that of
>>> mine maybe with the exception of the results on FactorIX tests
>>> where I had perfect stability and 100% or close to 100% reproducibility.
>>>
>>> Anyway the type of errs which you reported are the same which I obtained.
>>>
>>> So let's see if the bugfix 18 will help here (or at least on NPT tests)
>>> or not. As I wrote just before few minutes, it seems that it was not
>>> still
>>> loaded
>>> to the given server, although it's description is already present on the
>>> given
>>> web page ( see http://ambermd.org/bugfixes12.**html<http://ambermd.org/bugfixes12.html>).
>>>
>>> As you can see, this bugfix contains also changes in CPU code although
>>> the majority is devoted to GPU code, so perhaps the best will be to
>>> recompile
>>> whole amber with this patch although this patch would be perhaps applied
>>> even after just
>>> GPU configure command ( i.e. ./configure -cuda -noX11 gnu ) but after
>>> consequent
>>> building, just the GPU binaries will be updated. Anyway I would rather
>>> recompile
>>> whole Amber after this patch.
>>>
>>> Regarding to GPU test under linux you may try memtestG80
>>> (please use the updated/patched version from here
>>> https://github.com/ihaque/**memtestG80<https://github.com/ihaque/memtestG80>
>>> )
>>>
>>> just use git command like:
>>>
>>> git clone https://github.com/ihaque/**memtestG80.git<https://github.com/ihaque/memtestG80.git>PATCHED_MEMTEST-G80
>>>
>>> to download all the files and save them into directory named
>>> PATCHED_MEMTEST-G80.
>>>
>>> another possibility is to try perhaps similar (but maybe more up to date)
>>> test
>>> cuda_memtest ( http://sourceforge.net/**projects/cudagpumemtest/<http://sourceforge.net/projects/cudagpumemtest/>).
>>>
>>> regarding ig value: If ig is not present in mdin, the default value is
>>> used
>>> (e.g. 71277) if ig=-1 the random seed will be based on the current date
>>> and time, and hence will be different for every run (not a good variant
>>> for our testts). I simply deleted eventual ig records from all mdins so I
>>> assume that in each run the default seed 71277 was automatically used.
>>>
>>> M.
>>>
>>>
>>>
>>>
>>>
>>>
>>> Dne Sat, 01 Jun 2013 20:26:16 +0200 ET <sketchfoot.gmail.com> napsal/-a:
>>>
>>> Hi,
>>>>
>>>> I've put the graphics card into a machine with the working GTX titan
>>>> that I
>>>> mentioned earlier.
>>>>
>>>> The Nvidia driver version is: 133.30
>>>>
>>>> Amber version is:
>>>> AmberTools version 13.03
>>>> Amber version 12.16
>>>>
>>>> I ran 50k steps with the amber benchmark using ig=43689 on both cards.
>>>> For
>>>> the purpose of discriminating between them, the card I believe (fingers
>>>> crossed) is working is called GPU-00_TeaNCake, whilst the other one is
>>>> called GPU-01_008.
>>>>
>>>> *When I run the tests on GPU-01_008:*
>>>>
>>>> 1) All the tests (across 2x repeats) finish apart from the following
>>>> which
>>>> have the errors listed:
>>>>
>>>> ------------------------------**--------------
>>>> CELLULOSE_PRODUCTION_NVE - 408,609 atoms PME
>>>> Error: unspecified launch failure launching kernel kNLSkinTest
>>>> cudaFree GpuBuffer::Deallocate failed unspecified launch failure
>>>>
>>>> ------------------------------**--------------
>>>> CELLULOSE_PRODUCTION_NPT - 408,609 atoms PME
>>>> cudaMemcpy GpuBuffer::Download failed unspecified launch failure
>>>>
>>>> ------------------------------**--------------
>>>> CELLULOSE_PRODUCTION_NVE - 408,609 atoms PME
>>>> Error: unspecified launch failure launching kernel kNLSkinTest
>>>> cudaFree GpuBuffer::Deallocate failed unspecified launch failure
>>>>
>>>> ------------------------------**--------------
>>>> CELLULOSE_PRODUCTION_NPT - 408,609 atoms PME
>>>> cudaMemcpy GpuBuffer::Download failed unspecified launch failure
>>>> grep: mdinfo.1GTX680: No such file or directory
>>>>
>>>>
>>>>
>>>> 2) The sdiff logs indicate that reproducibility across the two repeats
>>>> is
>>>> as follows:
>>>>
>>>> *GB_myoglobin: *Reproducible across 50k steps
>>>> *GB_nucleosome:* Reproducible till step 7400
>>>> *GB_TRPCage:* Reproducible across 50k steps
>>>>
>>>> *PME_JAC_production_NVE: *No reproducibility shown from step 1,000
>>>> onwards
>>>> *PME_JAC_production_NPT*: Reproducible till step 1,000. Also outfile is
>>>> not written properly - blank gaps appear where something should have
>>>> been
>>>> written
>>>>
>>>> *PME_FactorIX_production_NVE:* Reproducible across 50k steps
>>>> *PME_FactorIX_production_NPT:* Reproducible across 50k steps
>>>>
>>>> *PME_Cellulose_production_NVE:*** Failure means that both runs do not
>>>> finish
>>>> (see point1)
>>>> *PME_Cellulose_production_NPT: *Failure means that both runs do not
>>>> finish
>>>> (see point1)
>>>>
>>>> ##############################**##############################**
>>>> ###########################
>>>>
>>>> *When I run the tests on * *GPU-00_TeaNCake:*
>>>> *
>>>> *
>>>> 1) All the tests (across 2x repeats) finish apart from the following
>>>> which
>>>> have the errors listed:
>>>> ------------------------------**-------
>>>> JAC_PRODUCTION_NPT - 23,558 atoms PME
>>>> PMEMD Terminated Abnormally!
>>>> ------------------------------**-------
>>>>
>>>>
>>>> 2) The sdiff logs indicate that reproducibility across the two repeats
>>>> is
>>>> as follows:
>>>>
>>>> *GB_myoglobin:* Reproducible across 50k steps
>>>> *GB_nucleosome:* Reproducible across 50k steps
>>>> *GB_TRPCage:* Reproducible across 50k steps
>>>>
>>>> *PME_JAC_production_NVE:* No reproducibility shown from step 10,000
>>>> onwards
>>>> *PME_JAC_production_NPT: * No reproducibility shown from step 10,000
>>>> onwards. Also outfile is not written properly - blank gaps appear where
>>>> something should have been written. Repeat 2 Crashes with error noted in
>>>> 1.
>>>>
>>>> *PME_FactorIX_production_NVE:* No reproducibility shown from step 9,000
>>>> onwards
>>>> *PME_FactorIX_production_NPT: *Reproducible across 50k steps
>>>>
>>>> *PME_Cellulose_production_NVE: *No reproducibility shown from step 5,000
>>>> onwards
>>>> *PME_Cellulose_production_NPT: ** *No reproducibility shown from step
>>>> 29,000 onwards. Also outfile is not written properly - blank gaps appear
>>>> where something should have been written.
>>>>
>>>>
>>>> Out files and sdiff files are included as attatchments
>>>>
>>>> ##############################**###################
>>>>
>>>> So I'm going to update my nvidia driver to the latest version and patch
>>>> amber to the latest version and rerun the tests to see if there is any
>>>> improvement. Could someone let me know if it is necessary to recompile
>>>> any
>>>> or all of AMBER after applying the bugfixes?
>>>>
>>>> Additionally, I'm going to run memory tests and heaven benchmarks on the
>>>> cards to check whether they are faulty or not.
>>>>
>>>> I'm thinking that there is a mix of hardware error/configuration (esp in
>>>> the case of GPU-01_008) and amber software error in this situation. What
>>>> do
>>>> you guys think?
>>>>
>>>> Also am I right in thinking (from what Scott was saying) that all the
>>>> benchmarks should be reproducible across 50k steps but begin to diverge
>>>> at
>>>> around 100K steps? Is there any difference from in setting *ig *to an
>>>> explicit number to removing it from the mdin file?
>>>>
>>>> br,
>>>> g
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 31 May 2013 23:45, ET <sketchfoot.gmail.com> wrote:
>>>>
>>>> I don't need sysadmins, but sysadmins need me as it gives purpose to
>>>>> their
>>>>> bureaucratic existence. A encountered evil if working in an institution
>>>>> or
>>>>> comapny IMO. Good science and indiviguality being sacrificed for
>>>>> standardisation and mediocrity in the intrerests of maintaing a system
>>>>> that
>>>>> focusses on maintaining the system and not the objective.
>>>>>
>>>>> You need root to move fwd on these things, unfortunately. and ppl with
>>>>> root are kinda like your parents when you try to borrow money from them
>>>>> .
>>>>> age 12 :D
>>>>> On May 31, 2013 9:34 PM, "Marek Maly" <marek.maly.ujep.cz> wrote:
>>>>>
>>>>> Sorry why do you need sysadmins :)) ?
>>>>>>
>>>>>> BTW here is the most recent driver:
>>>>>>
>>>>>> http://www.nvidia.com/object/**linux-display-amd64-319.23-**
>>>>>> driver.html<http://www.nvidia.com/object/linux-display-amd64-319.23-driver.html>
>>>>>>
>>>>>> I do not remember anything easier than is to install driver
>>>>>> (especially
>>>>>> in case of binary (*.run) installer) :))
>>>>>>
>>>>>> M.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Dne Fri, 31 May 2013 22:02:34 +0200 ET <sketchfoot.gmail.com>
>>>>>> napsal/-a:
>>>>>>
>>>>>> > Yup. I know. I replaced a 680 and the everknowing sysadmins are
>>>>>> reluctant
>>>>>> > to install drivers not in the repositoery as they are lame. :(
>>>>>> > On May 31, 2013 7:14 PM, "Marek Maly" <marek.maly.ujep.cz> wrote:
>>>>>> >>
>>>>>> >> As I already wrote you,
>>>>>> >>
>>>>>> >> the first driver which properly/officially supports Titans, should
>>>>>> be
>>>>>> >> 313.26 .
>>>>>> >>
>>>>>> >> Anyway I am curious mainly about your 100K repetitive tests with
>>>>>> >> your Titan SC card. Especially in case of these tests ( JAC_NVE,
>>>>>> JAC_NPT
>>>>>> >> and CELLULOSE_NVE ) where
>>>>>> >> my Titans SC randomly failed or succeeded. In FACTOR_IX_NVE,
>>>>>> >> FACTOR_IX_NPT
>>>>>> >> tests both
>>>>>> >> my cards are perfectly stable (independently from drv. version) and
>>>>>> also
>>>>>> >> the runs
>>>>>> >> are perfectly or almost perfectly reproducible.
>>>>>> >>
>>>>>> >> Also if your test will crash please report the eventual errs.
>>>>>> >>
>>>>>> >> To this moment I have this actual library of errs on my Titans SC
>>>>>> GPUs.
>>>>>> >>
>>>>>> >> #1 ERR writtent in mdout:
>>>>>> >> ------
>>>>>> >> | ERROR: max pairlist cutoff must be less than unit cell max
>>>>>> sphere
>>>>>> >> radius!
>>>>>> >> ------
>>>>>> >>
>>>>>> >>
>>>>>> >> #2 no ERR writtent in mdout, ERR written in standard output
>>>>>> (nohup.out)
>>>>>> >>
>>>>>> >> ----
>>>>>> >> Error: unspecified launch failure launching kernel kNLSkinTest
>>>>>> >> cudaFree GpuBuffer::Deallocate failed unspecified launch failure
>>>>>> >> ----
>>>>>> >>
>>>>>> >>
>>>>>> >> #3 no ERR writtent in mdout, ERR written in standard output
>>>>>> (nohup.out)
>>>>>> >> ----
>>>>>> >> cudaMemcpy GpuBuffer::Download failed unspecified launch failure
>>>>>> >> ----
>>>>>> >>
>>>>>> >> Another question, regarding your Titan SC, it is also EVGA as in my
>>>>>> case
>>>>>> >> or it is another producer ?
>>>>>> >>
>>>>>> >> Thanks,
>>>>>> >>
>>>>>> >> M.
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> Dne Fri, 31 May 2013 19:17:03 +0200 ET <sketchfoot.gmail.com>
>>>>>> napsal/-a:
>>>>>> >>
>>>>>> >> > Well, this is interesting...
>>>>>> >> >
>>>>>> >> > I ran 50k steps on the Titan on the other machine with driver
>>>>>> 310.44
>>>>>> >> and
>>>>>> >> > it
>>>>>> >> > passed all the GB steps. i.e totally identical results over two
>>>>>> >> repeats.
>>>>>> >> > However, it failed all the PME tests after step 1000. I'm going
>>>>>> to
>>>>>> > update
>>>>>> >> > the driver and test it again.
>>>>>> >> >
>>>>>> >> > Files included as attachments.
>>>>>> >> >
>>>>>> >> > br,
>>>>>> >> > g
>>>>>> >> >
>>>>>> >> >
>>>>>> >> > On 31 May 2013 16:40, Marek Maly <marek.maly.ujep.cz> wrote:
>>>>>> >> >
>>>>>> >> >> One more thing,
>>>>>> >> >>
>>>>>> >> >> can you please check under which frequency is running that your
>>>>>> >> titan ?
>>>>>> >> >>
>>>>>> >> >> As the base frequency of normal Titans is 837MHz and the Boost
>>>>>> one
>>>>>> is
>>>>>> >> >> 876MHz I
>>>>>> >> >> assume that yor GPU is running automatically also under it's
>>>>>> boot
>>>>>> >> >> frequency (876MHz).
>>>>>> >> >> You can find this information e.g. in Amber mdout file.
>>>>>> >> >>
>>>>>> >> >> You also mentioned some crashes in your previous email. Your
>>>>>> ERRs
>>>>>> >> were
>>>>>> >> >> something like those here:
>>>>>> >> >>
>>>>>> >> >> #1 ERR writtent in mdout:
>>>>>> >> >> ------
>>>>>> >> >> | ERROR: max pairlist cutoff must be less than unit cell max
>>>>>> sphere
>>>>>> >> >> radius!
>>>>>> >> >> ------
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> #2 no ERR writtent in mdout, ERR written in standard output
>>>>>> >> (nohup.out)
>>>>>> >> >>
>>>>>> >> >> ----
>>>>>> >> >> Error: unspecified launch failure launching kernel kNLSkinTest
>>>>>> >> >> cudaFree GpuBuffer::Deallocate failed unspecified launch failure
>>>>>> >> >> ----
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> #3 no ERR writtent in mdout, ERR written in standard output
>>>>>> >> (nohup.out)
>>>>>> >> >> ----
>>>>>> >> >> cudaMemcpy GpuBuffer::Download failed unspecified launch failure
>>>>>> >> >> ----
>>>>>> >> >>
>>>>>> >> >> or you obtained some new/additional errs ?
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> M.
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> Dne Fri, 31 May 2013 17:30:57 +0200 filip fratev
>>>>>> >> <filipfratev.yahoo.com
>>>>>> >>
>>>>>> >> >> napsal/-a:
>>>>>> >> >>
>>>>>> >> >> > Hi,
>>>>>> >> >> > This is what I obtained for 50K tests and "normal" GTXTitan:
>>>>>> >> >> >
>>>>>> >> >> > run1:
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >>
>>>>>> >
>>>>>> ------------------------------**------------------------------**
>>>>>> ------------------
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> > A V E R A G E S O V E R 50 S T E P S
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> > NSTEP = 50000 TIME(PS) = 120.020 TEMP(K) = 299.87
>>>>>> >> PRESS
>>>>>> >> >> > = 0.0
>>>>>> >> >> > Etot = -443237.1079 EKtot = 257679.9750 EPtot
>>>>>> =
>>>>>> >> >> > -700917.0829
>>>>>> >> >> > BOND = 20193.1856 ANGLE = 53517.5432 DIHED
>>>>>> =
>>>>>> >> >> > 23575.4648
>>>>>> >> >> > 1-4 NB = 21759.5524 1-4 EEL = 742552.5939 VDWAALS
>>>>>> =
>>>>>> >> >> > 96286.7714
>>>>>> >> >> > EELEC = -1658802.1941 EHBOND = 0.0000 RESTRAINT
>>>>>> =
>>>>>> >> >> > 0.0000
>>>>>> >> >> >
>>>>>> >> >>
>>>>>> >
>>>>>> ------------------------------**------------------------------**
>>>>>> ------------------
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> > R M S F L U C T U A T I O N S
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> > NSTEP = 50000 TIME(PS) = 120.020 TEMP(K) = 0.33
>>>>>> >> PRESS
>>>>>> >> >> > = 0.0
>>>>>> >> >> > Etot = 11.2784 EKtot = 284.8999 EPtot
>>>>>> =
>>>>>> >> >> > 289.0773
>>>>>> >> >> > BOND = 136.3417 ANGLE = 214.0054 DIHED
>>>>>> =
>>>>>> >> >> > 59.4893
>>>>>> >> >> > 1-4 NB = 58.5891 1-4 EEL = 330.5400 VDWAALS
>>>>>> =
>>>>>> >> >> > 559.2079
>>>>>> >> >> > EELEC = 743.8771 EHBOND = 0.0000 RESTRAINT
>>>>>> =
>>>>>> >> >> > 0.0000
>>>>>> >> >> > |E(PBS) = 21.8119
>>>>>> >> >> >
>>>>>> >> >>
>>>>>> >
>>>>>> ------------------------------**------------------------------**
>>>>>> ------------------
>>>>>> >> >> >
>>>>>> >> >> > run2:
>>>>>> >> >> >
>>>>>> >> >>
>>>>>> >
>>>>>> ------------------------------**------------------------------**
>>>>>> ------------------
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> > A V E R A G E S O V E R 50 S T E P S
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> > NSTEP = 50000 TIME(PS) = 120.020 TEMP(K) = 299.89
>>>>>> >> PRESS
>>>>>> >> >> > = 0.0
>>>>>> >> >> > Etot = -443240.0999 EKtot = 257700.0950 EPtot
>>>>>> =
>>>>>> >> >> > -700940.1949
>>>>>> >> >> > BOND = 20241.9174 ANGLE = 53644.6694 DIHED
>>>>>> =
>>>>>> >> >> > 23541.3737
>>>>>> >> >> > 1-4 NB = 21803.1898 1-4 EEL = 742754.2254 VDWAALS
>>>>>> =
>>>>>> >> >> > 96298.8308
>>>>>> >> >> > EELEC = -1659224.4013 EHBOND = 0.0000 RESTRAINT
>>>>>> =
>>>>>> >> >> > 0.0000
>>>>>> >> >> >
>>>>>> >> >>
>>>>>> >
>>>>>> ------------------------------**------------------------------**
>>>>>> ------------------
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> > R M S F L U C T U A T I O N S
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> > NSTEP = 50000 TIME(PS) = 120.020 TEMP(K) = 0.41
>>>>>> >> PRESS
>>>>>> >> >> > = 0.0
>>>>>> >> >> > Etot = 10.7633 EKtot = 348.2819 EPtot
>>>>>> =
>>>>>> >> >> > 353.9918
>>>>>> >> >> > BOND = 106.5314 ANGLE = 196.7052 DIHED
>>>>>> =
>>>>>> >> >> > 69.7476
>>>>>> >> >> > 1-4 NB = 60.3435 1-4 EEL = 400.7466 VDWAALS
>>>>>> =
>>>>>> >> >> > 462.7763
>>>>>> >> >> > EELEC = 651.9857 EHBOND = 0.0000 RESTRAINT
>>>>>> =
>>>>>> >> >> > 0.0000
>>>>>> >> >> > |E(PBS) = 17.0642
>>>>>> >> >> >
>>>>>> >> >>
>>>>>> >
>>>>>> ------------------------------**------------------------------**
>>>>>> ------------------
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >>
>>>>>> >
>>>>>> ------------------------------**------------------------------**
>>>>>> --------------------
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> > ______________________________**__
>>>>>> >> >> > From: Marek Maly <marek.maly.ujep.cz>
>>>>>> >> >> > To: AMBER Mailing List <amber.ambermd.org>
>>>>>> >> >> > Sent: Friday, May 31, 2013 3:34 PM
>>>>>> >> >> > Subject: Re: [AMBER] experiences with EVGA GTX TITAN
>>>>>> Superclocked
>>>>>> -
>>>>>> >> >> > memtestG80 - UNDERclocking in Linux ?
>>>>>> >> >> >
>>>>>> >> >> > Hi here are my 100K results for driver 313.30 (and still Cuda
>>>>>> 5.0).
>>>>>> >> >> >
>>>>>> >> >> > The results are rather similar to those obtained
>>>>>> >> >> > under my original driver 319.17 (see the first table
>>>>>> >> >> > which I sent in this thread).
>>>>>> >> >> >
>>>>>> >> >> > M.
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> > Dne Fri, 31 May 2013 12:29:59 +0200 Marek Maly <
>>>>>> marek.maly.ujep.cz>
>>>>>> >> >> > napsal/-a:
>>>>>> >> >> >
>>>>>> >> >> >> Hi,
>>>>>> >> >> >>
>>>>>> >> >> >> please try to run at lest 100K tests twice to verify exact
>>>>>> >> >> >> reproducibility
>>>>>> >> >> >> of the results on the given card. If you find in any mdin
>>>>>> file
>>>>>> >> ig=-1
>>>>>> >> >> >> just
>>>>>> >> >> >> delete it to ensure that you are using the identical random
>>>>>> seed
>>>>>> >> for
>>>>>> >> >> >> both
>>>>>> >> >> >> runs. You can eventually omit NUCLEOSOME test
>>>>>> >> >> >> as it is too much time consuming.
>>>>>> >> >> >>
>>>>>> >> >> >> Driver 310.44 ?????
>>>>>> >> >> >>
>>>>>> >> >> >> As far as I know the proper support for titans is from
>>>>>> version
>>>>>> > 313.26
>>>>>> >> >> >>
>>>>>> >> >> >> see e.g. here :
>>>>>> >> >> >>
>>>>>> >> >>
>>>>>> >
>>>>>> http://www.geeks3d.com/**20130306/nvidia-releases-r313-**
>>>>>> 26-for-linux-with-gtx-titan-**support/<http://www.geeks3d.com/20130306/nvidia-releases-r313-26-for-linux-with-gtx-titan-support/>
>>>>>> >> >> >>
>>>>>> >> >> >> BTW: On my site downgrade to drv. 313.30 did not solved the
>>>>>> >> >> situation, I
>>>>>> >> >> >> will post
>>>>>> >> >> >> my results soon here.
>>>>>> >> >> >>
>>>>>> >> >> >> M.
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> Dne Fri, 31 May 2013 12:21:21 +0200 ET <sketchfoot.gmail.com
>>>>>> >
>>>>>> >> >> napsal/-a:
>>>>>> >> >> >>
>>>>>> >> >> >>> ps. I have another install of amber on another computer with
>>>>>> a
>>>>>> >> >> >>> different
>>>>>> >> >> >>> Titan and different Driver Version: 310.44.
>>>>>> >> >> >>>
>>>>>> >> >> >>> In the interests of thrashing the proverbial horse, I'll run
>>>>>> the
>>>>>> >> >> >>> benchmark
>>>>>> >> >> >>> for 50k steps. :P
>>>>>> >> >> >>>
>>>>>> >> >> >>> br,
>>>>>> >> >> >>> g
>>>>>> >> >> >>>
>>>>>> >> >> >>>
>>>>>> >> >> >>> On 31 May 2013 11:17, ET <sketchfoot.gmail.com> wrote:
>>>>>> >> >> >>>
>>>>>> >> >> >>>> Hi, I just ran the Amber benchmark for the default (10000
>>>>>> steps)
>>>>>> >> >> on my
>>>>>> >> >> >>>> Titan.
>>>>>> >> >> >>>>
>>>>>> >> >> >>>> Using sdiff -sB showed that the two runs were completely
>>>>>> > identical.
>>>>>> >> >> >>>> I've
>>>>>> >> >> >>>> attached compressed files of the mdout & diff files.
>>>>>> >> >> >>>>
>>>>>> >> >> >>>> br,
>>>>>> >> >> >>>> g
>>>>>> >> >> >>>>
>>>>>> >> >> >>>>
>>>>>> >> >> >>>> On 30 May 2013 23:41, Marek Maly <marek.maly.ujep.cz>
>>>>>> wrote:
>>>>>> >> >> >>>>
>>>>>> >> >> >>>>> OK, let's see. The eventual downclocking I see as the very
>>>>>> last
>>>>>> >> >> >>>>> possibility
>>>>>> >> >> >>>>> (if I don't decide for RMAing). But now still some other
>>>>>> >> >> experiments
>>>>>> >> >> >>>>> are
>>>>>> >> >> >>>>> available :))
>>>>>> >> >> >>>>> I just started 100K tests under 313.30 driver. For today
>>>>>> good
>>>>>> >> >> night
>>>>>> >> >> >>>>> ...
>>>>>> >> >> >>>>>
>>>>>> >> >> >>>>> M.
>>>>>> >> >> >>>>>
>>>>>> >> >> >>>>> Dne Fri, 31 May 2013 00:45:49 +0200 Scott Le Grand
>>>>>> >> >> >>>>> <varelse2005.gmail.com
>>>>>> >> >> >>>>> >
>>>>>> >> >> >>>>> napsal/-a:
>>>>>> >> >> >>>>>
>>>>>> >> >> >>>>> > It will be very interesting if this behavior persists
>>>>>> after
>>>>>> >> >> >>>>> downclocking.
>>>>>> >> >> >>>>> >
>>>>>> >> >> >>>>> > But right now, Titan 0 *looks* hosed and Titan 1 *looks*
>>>>>> like
>>>>>> > it
>>>>>> >> >> >>>>> needs
>>>>>> >> >> >>>>> > downclocking...
>>>>>> >> >> >>>>> > On May 30, 2013 3:20 PM, "Marek Maly"
>>>>>> <marek.maly.ujep.cz>
>>>>>> >> >> wrote:
>>>>>> >> >> >>>>> >
>>>>>> >> >> >>>>> >> Hi all,
>>>>>> >> >> >>>>> >>
>>>>>> >> >> >>>>> >> here are my results from the 500K steps 2 x repeated
>>>>>> > benchmarks
>>>>>> >> >> >>>>> >> under 319.23 driver and still Cuda 5.0 (see the
>>>>>> attached
>>>>>> >> table
>>>>>> >> >> ).
>>>>>> >> >> >>>>> >>
>>>>>> >> >> >>>>> >> It is hard to say if the results are better or worse
>>>>>> than
>>>>>> in
>>>>>> > my
>>>>>> >> >> >>>>> >> previous 100K test under driver 319.17.
>>>>>> >> >> >>>>> >>
>>>>>> >> >> >>>>> >> While results from Cellulose test were improved and the
>>>>>> > TITAN_1
>>>>>> >> >> >>>>> card
>>>>>> >> >> >>>>> >> even
>>>>>> >> >> >>>>> >> successfully finished all 500K steps moreover with
>>>>>> exactly
>>>>>> >> the
>>>>>> >> >> >>>>> same
>>>>>> >> >> >>>>> >> final
>>>>>> >> >> >>>>> >> energy !
>>>>>> >> >> >>>>> >> (TITAN_0 at least finished more than 100K steps and in
>>>>>> >> RUN_01
>>>>>> >> >> even
>>>>>> >> >> >>>>> more
>>>>>> >> >> >>>>> >> than 400K steps)
>>>>>> >> >> >>>>> >> In JAC_NPT test no GPU was able to finish at least
>>>>>> 100K
>>>>>> >> steps
>>>>>> >> >> and
>>>>>> >> >> >>>>> the
>>>>>> >> >> >>>>> >> results from JAC_NVE
>>>>>> >> >> >>>>> >> test are also not too much convincing. FACTOR_IX_NVE
>>>>>> and
>>>>>> >> >> >>>>> FACTOR_IX_NPT
>>>>>> >> >> >>>>> >> were successfully
>>>>>> >> >> >>>>> >> finished with 100% reproducibility in FACTOR_IX_NPT
>>>>>> case
>>>>>> >> (on
>>>>>> >> >> both
>>>>>> >> >> >>>>> >> cards)
>>>>>> >> >> >>>>> >> and almost
>>>>>> >> >> >>>>> >> 100% reproducibility in case of FACTOR_IX_NVE (again
>>>>>> 100%
>>>>>> in
>>>>>> >> >> case
>>>>>> >> >> >>>>> of
>>>>>> >> >> >>>>> >> TITAN_1). TRPCAGE, MYOGLOBIN
>>>>>> >> >> >>>>> >> again finished without any problem with 100%
>>>>>> >> reproducibility.
>>>>>> >> >> >>>>> NUCLEOSOME
>>>>>> >> >> >>>>> >> test was not done
>>>>>> >> >> >>>>> >> this time due to high time requirements. If you find in
>>>>>> the
>>>>>> >> >> table
>>>>>> >> >> >>>>> >> positive
>>>>>> >> >> >>>>> >> number finishing with
>>>>>> >> >> >>>>> >> K (which means "thousands") it means the last number of
>>>>>> step
>>>>>> >> >> >>>>> written in
>>>>>> >> >> >>>>> >> mdout before crash.
>>>>>> >> >> >>>>> >> Below are all the 3 types of detected errs with
>>>>>> relevant
>>>>>> >> >> >>>>> systems/rounds
>>>>>> >> >> >>>>> >> where the given err
>>>>>> >> >> >>>>> >> appeared.
>>>>>> >> >> >>>>> >>
>>>>>> >> >> >>>>> >> Now I will try just 100K tests under ETs favourite
>>>>>> driver
>>>>>> >> >> version
>>>>>> >> >> >>>>> 313.30
>>>>>> >> >> >>>>> >> :)) and then
>>>>>> >> >> >>>>> >> I will eventually try to experiment with cuda 5.5 which
>>>>>> I
>>>>>> >> >> already
>>>>>> >> >> >>>>> >> downloaded from the
>>>>>> >> >> >>>>> >> cuda zone ( I had to become cuda developer for this :))
>>>>>> )
>>>>>> >> BTW
>>>>>> >> >> ET
>>>>>> >> >> >>>>> thanks
>>>>>> >> >> >>>>> >> for the frequency info !
>>>>>> >> >> >>>>> >> and I am still ( perhaps not alone :)) ) very curious
>>>>>> about
>>>>>> >> >> your 2
>>>>>> >> >> >>>>> x
>>>>>> >> >> >>>>> >> repeated Amber benchmark tests with superclocked Titan.
>>>>>> >> Indeed
>>>>>> >> >> >>>>> that
>>>>>> >> >> >>>>> I
>>>>>> >> >> >>>>> am
>>>>>> >> >> >>>>> >> very curious also about that Ross "hot" patch.
>>>>>> >> >> >>>>> >>
>>>>>> >> >> >>>>> >> M.
>>>>>> >> >> >>>>> >>
>>>>>> >> >> >>>>> >> ERRORS DETECTED DURING THE 500K steps tests with driver
>>>>>> >> 319.23
>>>>>> >> >> >>>>> >>
>>>>>> >> >> >>>>> >> #1 ERR writtent in mdout:
>>>>>> >> >> >>>>> >> ------
>>>>>> >> >> >>>>> >> | ERROR: max pairlist cutoff must be less than unit
>>>>>> cell
>>>>>> >> max
>>>>>> >> >> >>>>> sphere
>>>>>> >> >> >>>>> >> radius!
>>>>>> >> >> >>>>> >> ------
>>>>>> >> >> >>>>> >>
>>>>>> >> >> >>>>> >> TITAN_0 ROUND_1 JAC_NPT (at least 5000 steps
>>>>>> successfully
>>>>>> > done
>>>>>> >> >> >>>>> before
>>>>>> >> >> >>>>> >> crash)
>>>>>> >> >> >>>>> >> TITAN_0 ROUND_2 JAC_NPT (at least 8000 steps
>>>>>> successfully
>>>>>> > done
>>>>>> >> >> >>>>> before
>>>>>> >> >> >>>>> >> crash)
>>>>>> >> >> >>>>> >>
>>>>>> >> >> >>>>> >>
>>>>>> >> >> >>>>> >> #2 no ERR writtent in mdout, ERR written in standard
>>>>>> output
>>>>>> >> >> >>>>> (nohup.out)
>>>>>> >> >> >>>>> >>
>>>>>> >> >> >>>>> >> ----
>>>>>> >> >> >>>>> >> Error: unspecified launch failure launching kernel
>>>>>> >> kNLSkinTest
>>>>>> >> >> >>>>> >> cudaFree GpuBuffer::Deallocate failed unspecified
>>>>>> launch
>>>>>> >> >> failure
>>>>>> >> >> >>>>> >> ----
>>>>>> >> >> >>>>> >>
>>>>>> >> >> >>>>> >> TITAN_0 ROUND_1 CELLULOSE_NVE (at least 437 000 steps
>>>>>> >> >> successfully
>>>>>> >> >> >>>>> done
>>>>>> >> >> >>>>> >> before crash)
>>>>>> >> >> >>>>> >> TITAN_0 ROUND_2 JAC_NVE (at least 162 000 steps
>>>>>> >> successfully
>>>>>> >> >> done
>>>>>> >> >> >>>>> >> before
>>>>>> >> >> >>>>> >> crash)
>>>>>> >> >> >>>>> >> TITAN_0 ROUND_2 CELLULOSE_NVE (at least 117 000 steps
>>>>>> >> >> successfully
>>>>>> >> >> >>>>> done
>>>>>> >> >> >>>>> >> before crash)
>>>>>> >> >> >>>>> >> TITAN_1 ROUND_1 JAC_NVE (at least 119 000 steps
>>>>>> >> successfully
>>>>>> >> >> done
>>>>>> >> >> >>>>> >> before
>>>>>> >> >> >>>>> >> crash)
>>>>>> >> >> >>>>> >> TITAN_1 ROUND_2 JAC_NVE (at least 43 000 steps
>>>>>> successfully
>>>>>> >> >> done
>>>>>> >> >> >>>>> before
>>>>>> >> >> >>>>> >> crash)
>>>>>> >> >> >>>>> >>
>>>>>> >> >> >>>>> >>
>>>>>> >> >> >>>>> >> #3 no ERR writtent in mdout, ERR written in standard
>>>>>> output
>>>>>> >> >> >>>>> (nohup.out)
>>>>>> >> >> >>>>> >> ----
>>>>>> >> >> >>>>> >> cudaMemcpy GpuBuffer::Download failed unspecified
>>>>>> launch
>>>>>> >> >> failure
>>>>>> >> >> >>>>> >> ----
>>>>>> >> >> >>>>> >>
>>>>>> >> >> >>>>> >> TITAN_1 ROUND_1 JAC_NPT (at least 77 000 steps
>>>>>> successfully
>>>>>> >> >> done
>>>>>> >> >> >>>>> before
>>>>>> >> >> >>>>> >> crash)
>>>>>> >> >> >>>>> >> TITAN_1 ROUND_2 JAC_NPT (at least 58 000 steps
>>>>>> successfully
>>>>>> >> >> done
>>>>>> >> >> >>>>> before
>>>>>> >> >> >>>>> >> crash)
>>>>>> >> >> >>>>> >>
>>>>>> >> >> >>>>> >>
>>>>>> >> >> >>>>> >>
>>>>>> >> >> >>>>> >>
>>>>>> >> >> >>>>> >>
>>>>>> >> >> >>>>> >>
>>>>>> >> >> >>>>> >>
>>>>>> >> >> >>>>> >>
>>>>>> >> >> >>>>> >>
>>>>>> >> >> >>>>> >>
>>>>>> >> >> >>>>> >>
>>>>>> >> >> >>>>> >>
>>>>>> >> >> >>>>> >>
>>>>>> >> >> >>>>> >>
>>>>>> >> >> >>>>> >> Dne Thu, 30 May 2013 21:27:17 +0200 Scott Le Grand
>>>>>> >> >> >>>>> >> <varelse2005.gmail.com>
>>>>>> >> >> >>>>> >> napsal/-a:
>>>>>> >> >> >>>>> >>
>>>>>> >> >> >>>>> >> Oops meant to send that to Jason...
>>>>>> >> >> >>>>> >>>
>>>>>> >> >> >>>>> >>> Anyway, before we all panic, we need to get K20's
>>>>>> behavior
>>>>>> >> >> >>>>> analyzed
>>>>>> >> >> >>>>> >>> here.
>>>>>> >> >> >>>>> >>> If it's deterministic, this truly is a hardware
>>>>>> issue. If
>>>>>> >> >> not,
>>>>>> >> >> >>>>> then
>>>>>> >> >> >>>>> it
>>>>>> >> >> >>>>> >>> gets interesting because 680 is deterministic as far
>>>>>> as I
>>>>>> >> can
>>>>>> >> >> >>>>> tell...
>>>>>> >> >> >>>>> >>> On May 30, 2013 12:24 PM, "Scott Le Grand"
>>>>>> >> >> >>>>> <varelse2005.gmail.com>
>>>>>> >> >> >>>>> >>> wrote:
>>>>>> >> >> >>>>> >>>
>>>>>> >> >> >>>>> >>> If the errors are not deterministically triggered,
>>>>>> they
>>>>>> >> >> probably
>>>>>> >> >> >>>>> >>> won't be
>>>>>> >> >> >>>>> >>>> fixed by the patch, alas...
>>>>>> >> >> >>>>> >>>> On May 30, 2013 12:15 PM, "Jason Swails"
>>>>>> >> >> >>>>> <jason.swails.gmail.com>
>>>>>> >> >> >>>>> >>>> wrote:
>>>>>> >> >> >>>>> >>>>
>>>>>> >> >> >>>>> >>>> Just a reminder to everyone based on what Ross said:
>>>>>> >> there
>>>>>> >> >> is a
>>>>>> >> >> >>>>> >>>> pending
>>>>>> >> >> >>>>> >>>>> patch to pmemd.cuda that will be coming out shortly
>>>>>> >> (maybe
>>>>>> >> >> even
>>>>>> >> >> >>>>> >>>>> within
>>>>>> >> >> >>>>> >>>>> hours). It's entirely possible that several of
>>>>>> these
>>>>>> > errors
>>>>>> >> >> >>>>> are
>>>>>> >> >> >>>>> >>>>> fixed
>>>>>> >> >> >>>>> >>>>> by
>>>>>> >> >> >>>>> >>>>> this patch.
>>>>>> >> >> >>>>> >>>>>
>>>>>> >> >> >>>>> >>>>> All the best,
>>>>>> >> >> >>>>> >>>>> Jason
>>>>>> >> >> >>>>> >>>>>
>>>>>> >> >> >>>>> >>>>>
>>>>>> >> >> >>>>> >>>>> On Thu, May 30, 2013 at 2:46 PM, filip fratev <
>>>>>> >> >> >>>>> filipfratev.yahoo.com>
>>>>>> >> >> >>>>> >>>>> wrote:
>>>>>> >> >> >>>>> >>>>>
>>>>>> >> >> >>>>> >>>>> > I have observed the same crashes from time to
>>>>>> time. I
>>>>>> > will
>>>>>> >> >> >>>>> run
>>>>>> >> >> >>>>> >>>>> cellulose
>>>>>> >> >> >>>>> >>>>> > nve for 100k and will past results here.
>>>>>> >> >> >>>>> >>>>> >
>>>>>> >> >> >>>>> >>>>> > All the best,
>>>>>> >> >> >>>>> >>>>> > Filip
>>>>>> >> >> >>>>> >>>>> >
>>>>>> >> >> >>>>> >>>>> >
>>>>>> >> >> >>>>> >>>>> >
>>>>>> >> >> >>>>> >>>>> >
>>>>>> >> >> >>>>> >>>>> > ______________________________****__
>>>>>> >> >> >>>>> >>>>> > From: Scott Le Grand <varelse2005.gmail.com>
>>>>>> >> >> >>>>> >>>>> > To: AMBER Mailing List <amber.ambermd.org>
>>>>>> >> >> >>>>> >>>>> > Sent: Thursday, May 30, 2013 9:01 PM
>>>>>> >> >> >>>>> >>>>> > Subject: Re: [AMBER] experiences with EVGA GTX
>>>>>> TITAN
>>>>>> >> >> >>>>> Superclocked
>>>>>> >> >> >>>>> -
>>>>>> >> >> >>>>> >>>>> > memtestG80 - UNDERclocking in Linux ?
>>>>>> >> >> >>>>> >>>>> >
>>>>>> >> >> >>>>> >>>>> >
>>>>>> >> >> >>>>> >>>>> > Run cellulose nve for 100k iterations twice . If
>>>>>> the
>>>>>> >> >> final
>>>>>> >> >> >>>>> >>>>> energies
>>>>>> >> >> >>>>> >>>>> don't
>>>>>> >> >> >>>>> >>>>> > match, you have a hardware issue. No need to play
>>>>>> with
>>>>>> >> >> ntpr
>>>>>> >> >> >>>>> or
>>>>>> >> >> >>>>> any
>>>>>> >> >> >>>>> >>>>> other
>>>>>> >> >> >>>>> >>>>> > variable.
>>>>>> >> >> >>>>> >>>>> > On May 30, 2013 10:58 AM, <pavel.banas.upol.cz>
>>>>>> wrote:
>>>>>> >> >> >>>>> >>>>> >
>>>>>> >> >> >>>>> >>>>> > >
>>>>>> >> >> >>>>> >>>>> > > Dear all,
>>>>>> >> >> >>>>> >>>>> > >
>>>>>> >> >> >>>>> >>>>> > > I would also like to share one of my experience
>>>>>> with
>>>>>> >> >> titan
>>>>>> >> >> >>>>> >>>>> cards. We
>>>>>> >> >> >>>>> >>>>> have
>>>>>> >> >> >>>>> >>>>> > > one gtx titan card and with one system (~55k
>>>>>> atoms,
>>>>>> > NVT,
>>>>>> >> >> >>>>> >>>>> RNA+waters)
>>>>>> >> >> >>>>> >>>>> we
>>>>>> >> >> >>>>> >>>>> > run
>>>>>> >> >> >>>>> >>>>> > > into same troubles you are describing. I was
>>>>>> also
>>>>>> >> >> playing
>>>>>> >> >> >>>>> with
>>>>>> >> >> >>>>> >>>>> ntpr
>>>>>> >> >> >>>>> >>>>> to
>>>>>> >> >> >>>>> >>>>> > > figure out what is going on, step by step. I
>>>>>> >> understand
>>>>>> >> >> >>>>> that
>>>>>> >> >> >>>>> the
>>>>>> >> >> >>>>> >>>>> code
>>>>>> >> >> >>>>> >>>>> is
>>>>>> >> >> >>>>> >>>>> > > using different routines for calculation
>>>>>> >> >> energies+forces or
>>>>>> >> >> >>>>> only
>>>>>> >> >> >>>>> >>>>> forces.
>>>>>> >> >> >>>>> >>>>> > > The
>>>>>> >> >> >>>>> >>>>> > > simulations of other systems are perfectly
>>>>>> stable,
>>>>>> >> >> running
>>>>>> >> >> >>>>> for
>>>>>> >> >> >>>>> >>>>> days
>>>>>> >> >> >>>>> >>>>> and
>>>>>> >> >> >>>>> >>>>> > > weeks. Only that particular system
>>>>>> systematically
>>>>>> >> ends
>>>>>> >> >> up
>>>>>> >> >> >>>>> with
>>>>>> >> >> >>>>> >>>>> this
>>>>>> >> >> >>>>> >>>>> > error.
>>>>>> >> >> >>>>> >>>>> > >
>>>>>> >> >> >>>>> >>>>> > > However, there was one interesting issue. When I
>>>>>> set
>>>>>> >> >> >>>>> ntpr=1,
>>>>>> >> >> >>>>> the
>>>>>> >> >> >>>>> >>>>> error
>>>>>> >> >> >>>>> >>>>> > > vanished (systematically in multiple runs) and
>>>>>> the
>>>>>> >> >> >>>>> simulation
>>>>>> >> >> >>>>> was
>>>>>> >> >> >>>>> >>>>> able to
>>>>>> >> >> >>>>> >>>>> > > run for more than millions of steps (I was not
>>>>>> let
>>>>>> it
>>>>>> >> >> >>>>> running
>>>>>> >> >> >>>>> for
>>>>>> >> >> >>>>> >>>>> weeks
>>>>>> >> >> >>>>> >>>>> > as
>>>>>> >> >> >>>>> >>>>> > > in the meantime I shifted that simulation to
>>>>>> other
>>>>>> >> card
>>>>>> >> >> -
>>>>>> >> >> >>>>> need
>>>>>> >> >> >>>>> >>>>> data,
>>>>>> >> >> >>>>> >>>>> not
>>>>>> >> >> >>>>> >>>>> > > testing). All other setting of ntpr failed. As I
>>>>>> read
>>>>>> >> >> this
>>>>>> >> >> >>>>> >>>>> discussion, I
>>>>>> >> >> >>>>> >>>>> > > tried to set ene_avg_sampling=1 with some high
>>>>>> value
>>>>>> >> of
>>>>>> >> >> >>>>> ntpr
>>>>>> >> >> >>>>> (I
>>>>>> >> >> >>>>> >>>>> expected
>>>>>> >> >> >>>>> >>>>> > > that this will shift the code to permanently use
>>>>>> the
>>>>>> >> >> >>>>> >>>>> force+energies
>>>>>> >> >> >>>>> >>>>> part
>>>>>> >> >> >>>>> >>>>> > of
>>>>>> >> >> >>>>> >>>>> > > the code, similarly to ntpr=1), but the error
>>>>>> >> occurred
>>>>>> >> >> >>>>> again.
>>>>>> >> >> >>>>> >>>>> > >
>>>>>> >> >> >>>>> >>>>> > > I know it is not very conclusive for finding out
>>>>>> what
>>>>>> > is
>>>>>> >> >> >>>>> >>>>> happening,
>>>>>> >> >> >>>>> >>>>> at
>>>>>> >> >> >>>>> >>>>> > > least
>>>>>> >> >> >>>>> >>>>> > > not for me. Do you have any idea, why ntpr=1
>>>>>> might
>>>>>> > help?
>>>>>> >> >> >>>>> >>>>> > >
>>>>>> >> >> >>>>> >>>>> > > best regards,
>>>>>> >> >> >>>>> >>>>> > >
>>>>>> >> >> >>>>> >>>>> > > Pavel
>>>>>> >> >> >>>>> >>>>> > >
>>>>>> >> >> >>>>> >>>>> > >
>>>>>> >> >> >>>>> >>>>> > >
>>>>>> >> >> >>>>> >>>>> > >
>>>>>> >> >> >>>>> >>>>> > >
>>>>>> >> >> >>>>> >>>>> > > --
>>>>>> >> >> >>>>> >>>>> > > Pavel Banáš
>>>>>> >> >> >>>>> >>>>> > > pavel.banas.upol.cz
>>>>>> >> >> >>>>> >>>>> > > Department of Physical Chemistry,
>>>>>> >> >> >>>>> >>>>> > > Palacky University Olomouc
>>>>>> >> >> >>>>> >>>>> > > Czech Republic
>>>>>> >> >> >>>>> >>>>> > >
>>>>>> >> >> >>>>> >>>>> > >
>>>>>> >> >> >>>>> >>>>> > >
>>>>>> >> >> >>>>> >>>>> > > ---------- Původní zpráva ----------
>>>>>> >> >> >>>>> >>>>> > > Od: Jason Swails <jason.swails.gmail.com>
>>>>>> >> >> >>>>> >>>>> > > Datum: 29. 5. 2013
>>>>>> >> >> >>>>> >>>>> > > Předmět: Re: [AMBER] experiences with EVGA GTX
>>>>>> TITAN
>>>>>> >> >> >>>>> >>>>> Superclocked -
>>>>>> >> >> >>>>> >>>>> > > memtestG
>>>>>> >> >> >>>>> >>>>> > > 80 - UNDERclocking in Linux ?
>>>>>> >> >> >>>>> >>>>> > >
>>>>>> >> >> >>>>> >>>>> > > "I'll answer a little bit:
>>>>>> >> >> >>>>> >>>>> > >
>>>>>> >> >> >>>>> >>>>> > > NTPR=10 Etot after 2000 steps
>>>>>> >> >> >>>>> >>>>> > > >
>>>>>> >> >> >>>>> >>>>> > > > -443256.6711
>>>>>> >> >> >>>>> >>>>> > > > -443256.6711
>>>>>> >> >> >>>>> >>>>> > > >
>>>>>> >> >> >>>>> >>>>> > > > NTPR=200 Etot after 2000 steps
>>>>>> >> >> >>>>> >>>>> > > >
>>>>>> >> >> >>>>> >>>>> > > > -443261.0705
>>>>>> >> >> >>>>> >>>>> > > > -443261.0705
>>>>>> >> >> >>>>> >>>>> > > >
>>>>>> >> >> >>>>> >>>>> > > > Any idea why energies should depend on
>>>>>> frequency
>>>>>> of
>>>>>> >> >> >>>>> energy
>>>>>> >> >> >>>>> >>>>> records
>>>>>> >> >> >>>>> >>>>> > (NTPR)
>>>>>> >> >> >>>>> >>>>> > > ?
>>>>>> >> >> >>>>> >>>>> > > >
>>>>>> >> >> >>>>> >>>>> > >
>>>>>> >> >> >>>>> >>>>> > > It is a subtle point, but the answer is
>>>>>> 'different
>>>>>> >> code
>>>>>> >> >> >>>>> paths.'
>>>>>> >> >> >>>>> >>>>> In
>>>>>> >> >> >>>>> >>>>> > > general, it is NEVER necessary to compute the
>>>>>> actual
>>>>>> >> >> energy
>>>>>> >> >> >>>>> of a
>>>>>> >> >> >>>>> >>>>> molecule
>>>>>> >> >> >>>>> >>>>> > > during the course of standard molecular dynamics
>>>>>> (by
>>>>>> >> >> >>>>> analogy, it
>>>>>> >> >> >>>>> >>>>> is
>>>>>> >> >> >>>>> >>>>> NEVER
>>>>>> >> >> >>>>> >>>>> > > necessary to compute atomic forces during the
>>>>>> course
>>>>>> >> of
>>>>>> >> >> >>>>> random
>>>>>> >> >> >>>>> >>>>> Monte
>>>>>> >> >> >>>>> >>>>> > Carlo
>>>>>> >> >> >>>>> >>>>> > > sampling).
>>>>>> >> >> >>>>> >>>>> > >
>>>>>> >> >> >>>>> >>>>> > > For performance's sake, then, pmemd.cuda
>>>>>> computes
>>>>>> >> only
>>>>>> >> >> the
>>>>>> >> >> >>>>> force
>>>>>> >> >> >>>>> >>>>> when
>>>>>> >> >> >>>>> >>>>> > > energies are not requested, leading to a
>>>>>> different
>>>>>> >> >> order of
>>>>>> >> >> >>>>> >>>>> operations
>>>>>> >> >> >>>>> >>>>> > for
>>>>>> >> >> >>>>> >>>>> > > those runs. This difference ultimately causes
>>>>>> >> >> divergence.
>>>>>> >> >> >>>>> >>>>> > >
>>>>>> >> >> >>>>> >>>>> > > To test this, try setting the variable
>>>>>> >> >> ene_avg_sampling=10
>>>>>> >> >> >>>>> in
>>>>>> >> >> >>>>> the
>>>>>> >> >> >>>>> >>>>> &cntrl
>>>>>> >> >> >>>>> >>>>> > > section. This will force pmemd.cuda to compute
>>>>>> >> energies
>>>>>> >> >> >>>>> every 10
>>>>>> >> >> >>>>> >>>>> steps
>>>>>> >> >> >>>>> >>>>> > > (for energy averaging), which will in turn make
>>>>>> the
>>>>>> >> >> >>>>> followed
>>>>>> >> >> >>>>> code
>>>>>> >> >> >>>>> >>>>> path
>>>>>> >> >> >>>>> >>>>> > > identical for any multiple-of-10 value of ntpr.
>>>>>> >> >> >>>>> >>>>> > >
>>>>>> >> >> >>>>> >>>>> > > --
>>>>>> >> >> >>>>> >>>>> > > Jason M. Swails
>>>>>> >> >> >>>>> >>>>> > > Quantum Theory Project,
>>>>>> >> >> >>>>> >>>>> > > University of Florida
>>>>>> >> >> >>>>> >>>>> > > Ph.D. Candidate
>>>>>> >> >> >>>>> >>>>> > > 352-392-4032
>>>>>> >> >> >>>>> >>>>> > > ______________________________**
>>>>>> **_________________
>>>>>> >> >> >>>>> >>>>> > > AMBER mailing list
>>>>>> >> >> >>>>> >>>>> > > AMBER.ambermd.org
>>>>>> >> >> >>>>> >>>>> > >
>>>>>> >> >> >>>>> >>>>> http://lists.ambermd.org/****mailman/listinfo/amber<http://lists.ambermd.org/**mailman/listinfo/amber>
>>>>>> <
>>>>>> >> >> >>>>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>>> >
>>>>>> >> >> >>>>> >>>>> "
>>>>>> >> >> >>>>> >>>>> > > ______________________________**
>>>>>> **_________________
>>>>>> >> >> >>>>> >>>>> > > AMBER mailing list
>>>>>> >> >> >>>>> >>>>> > > AMBER.ambermd.org
>>>>>> >> >> >>>>> >>>>> > >
>>>>>> >> >> >>>>> >>>>> http://lists.ambermd.org/****mailman/listinfo/amber<http://lists.ambermd.org/**mailman/listinfo/amber>
>>>>>> <
>>>>>> >> >> >>>>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>>> >
>>>>>> >> >> >>>>> >>>>> > >
>>>>>> >> >> >>>>> >>>>> > ______________________________**
>>>>>> **_________________
>>>>>> >> >> >>>>> >>>>> > AMBER mailing list
>>>>>> >> >> >>>>> >>>>> > AMBER.ambermd.org
>>>>>> >> >> >>>>> >>>>> >
>>>>>> >> >> >>>>> >>>>> http://lists.ambermd.org/****mailman/listinfo/amber<http://lists.ambermd.org/**mailman/listinfo/amber>
>>>>>> <
>>>>>> >> >> >>>>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>>> >
>>>>>> >> >> >>>>> >>>>> > ______________________________**
>>>>>> **_________________
>>>>>> >> >> >>>>> >>>>> > AMBER mailing list
>>>>>> >> >> >>>>> >>>>> > AMBER.ambermd.org
>>>>>> >> >> >>>>> >>>>> >
>>>>>> >> >> >>>>> >>>>> http://lists.ambermd.org/****mailman/listinfo/amber<http://lists.ambermd.org/**mailman/listinfo/amber>
>>>>>> <
>>>>>> >> >> >>>>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>>> >
>>>>>> >> >> >>>>> >>>>> >
>>>>>> >> >> >>>>> >>>>>
>>>>>> >> >> >>>>> >>>>>
>>>>>> >> >> >>>>> >>>>>
>>>>>> >> >> >>>>> >>>>> --
>>>>>> >> >> >>>>> >>>>> Jason M. Swails
>>>>>> >> >> >>>>> >>>>> Quantum Theory Project,
>>>>>> >> >> >>>>> >>>>> University of Florida
>>>>>> >> >> >>>>> >>>>> Ph.D. Candidate
>>>>>> >> >> >>>>> >>>>> 352-392-4032
>>>>>> >> >> >>>>> >>>>> ______________________________****_________________
>>>>>> >> >> >>>>> >>>>> AMBER mailing list
>>>>>> >> >> >>>>> >>>>> AMBER.ambermd.org
>>>>>> >> >> >>>>> >>>>> http://lists.ambermd.org/****mailman/listinfo/amber<http://lists.ambermd.org/**mailman/listinfo/amber>
>>>>>> <
>>>>>> >> >> >>>>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>>> >
>>>>>> >> >> >>>>> >>>>>
>>>>>> >> >> >>>>> >>>>>
>>>>>> >> >> >>>>> >>>> ______________________________****_________________
>>>>>> >> >> >>>>> >>> AMBER mailing list
>>>>>> >> >> >>>>> >>> AMBER.ambermd.org
>>>>>> >> >> >>>>> >>> http://lists.ambermd.org/****mailman/listinfo/amber<http://lists.ambermd.org/**mailman/listinfo/amber>
>>>>>> <
>>>>>> >> >> >>>>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>>> >
>>>>>> >> >> >>>>> >>>
>>>>>> >> >> >>>>> >>> __________ Informace od ESET NOD32 Antivirus, verze
>>>>>> >> databaze
>>>>>> >> >> 8394
>>>>>> >> >> >>>>> >>> (20130530) __________
>>>>>> >> >> >>>>> >>>
>>>>>> >> >> >>>>> >>> Tuto zpravu proveril ESET NOD32 Antivirus.
>>>>>> >> >> >>>>> >>>
>>>>>> >> >> >>>>> >>> http://www.eset.cz
>>>>>> >> >> >>>>> >>>
>>>>>> >> >> >>>>> >>>
>>>>>> >> >> >>>>> >>>
>>>>>> >> >> >>>>> >>>
>>>>>> >> >> >>>>> >>
>>>>>> >> >> >>>>> >> --
>>>>>> >> >> >>>>> >> Tato zpráva byla vytvořena převratným poštovním
>>>>>> klientem
>>>>>> > Opery:
>>>>>> >> >> >>>>> >> http://www.opera.com/mail/
>>>>>> >> >> >>>>> >> ______________________________**_________________
>>>>>> >> >> >>>>> >> AMBER mailing list
>>>>>> >> >> >>>>> >> AMBER.ambermd.org
>>>>>> >> >> >>>>> >> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>>> >> >> >>>>> >>
>>>>>> >> >> >>>>> >>
>>>>>> >> >> >>>>> > ______________________________**_________________
>>>>>> >> >> >>>>> > AMBER mailing list
>>>>>> >> >> >>>>> > AMBER.ambermd.org
>>>>>> >> >> >>>>> > http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>>> >> >> >>>>> >
>>>>>> >> >> >>>>> > __________ Informace od ESET NOD32 Antivirus, verze
>>>>>> databaze
>>>>>> >> >> 8394
>>>>>> >> >> >>>>> > (20130530) __________
>>>>>> >> >> >>>>> >
>>>>>> >> >> >>>>> > Tuto zpravu proveril ESET NOD32 Antivirus.
>>>>>> >> >> >>>>> >
>>>>>> >> >> >>>>> > http://www.eset.cz
>>>>>> >> >> >>>>> >
>>>>>> >> >> >>>>> >
>>>>>> >> >> >>>>> >
>>>>>> >> >> >>>>>
>>>>>> >> >> >>>>>
>>>>>> >> >> >>>>> --
>>>>>> >> >> >>>>> Tato zpráva byla vytvořena převratným poštovním klientem
>>>>>> Opery:
>>>>>> >> >> >>>>> http://www.opera.com/mail/
>>>>>> >> >> >>>>>
>>>>>> >> >> >>>>> ______________________________**_________________
>>>>>> >> >> >>>>> AMBER mailing list
>>>>>> >> >> >>>>> AMBER.ambermd.org
>>>>>> >> >> >>>>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>>> >> >> >>>>>
>>>>>> >> >> >>>>
>>>>>> >> >> >>>>
>>>>>> >> >> >>> ______________________________**_________________
>>>>>> >> >> >>> AMBER mailing list
>>>>>> >> >> >>> AMBER.ambermd.org
>>>>>> >> >> >>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>>> >> >> >>>
>>>>>> >> >> >>> __________ Informace od ESET NOD32 Antivirus, verze databaze
>>>>>> 8395
>>>>>> >> >> >>> (20130531) __________
>>>>>> >> >> >>>
>>>>>> >> >> >>> Tuto zpravu proveril ESET NOD32 Antivirus.
>>>>>> >> >> >>>
>>>>>> >> >> >>> http://www.eset.cz
>>>>>> >> >> >>>
>>>>>> >> >> >>>
>>>>>> >> >> >>>
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> --
>>>>>> >> >> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>>>>>> >> >> http://www.opera.com/mail/
>>>>>> >> >>
>>>>>> >> >> ______________________________**_________________
>>>>>> >> >> AMBER mailing list
>>>>>> >> >> AMBER.ambermd.org
>>>>>> >> >> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>>> >> >>
>>>>>> >> >
>>>>>> >> >
>>>>>> >> >
>>>>>> >> > __________ Informace od ESET NOD32 Antivirus, verze databaze 8397
>>>>>> >> > (20130531) __________
>>>>>> >> >
>>>>>> >> > Tuto zpravu proveril ESET NOD32 Antivirus.
>>>>>> >> >
>>>>>> >> > GB_out_plus_diff_Files.tar.gz - poskozeny archiv
>>>>>> >> > GB_out_plus_diff_Files.tar.gz > GZIP >
>>>>>> >> GB_out_plus_diff_Files.tar
>>>>>> >> > - poskozeny archiv
>>>>>> >> > GB_out_plus_diff_Files.tar.gz > GZIP >
>>>>>> >> > GB_out_plus_diff_Files.tar > TAR > GB_out_plus_diff_Files.tar.gz
>>>>>> -
>>>>>> >> > poskozeny archiv
>>>>>> >> > GB_out_plus_diff_Files.tar.gz > GZIP >
>>>>>> >> > GB_out_plus_diff_Files.tar > TAR > GB_out_plus_diff_Files.tar.gz
>>>>>> >
>>>>>> >> GZIP
>>>>>> >> > > GB_out_plus_diff_Files.tar - poskozeny archiv
>>>>>> >> > GB_out_plus_diff_Files.tar.gz > GZIP >
>>>>>> >> > GB_out_plus_diff_Files.tar > TAR > GB_out_plus_diff_Files.tar.gz
>>>>>> >
>>>>>> >> GZIP
>>>>>> >> > > GB_out_plus_diff_Files.tar > TAR >
>>>>>> GB_nucleosome-sim3.mdout-full -
>>>>>> >> > vyskytl se problem pri cteni archivu
>>>>>> >> > PME_out_plus_diff_Files.tar.gz - poskozeny archiv
>>>>>> >> > PME_out_plus_diff_Files.tar.gz > GZIP >
>>>>>> >> > PME_out_plus_diff_Files.tar - poskozeny archiv
>>>>>> >> > PME_out_plus_diff_Files.tar.gz > GZIP >
>>>>>> >> > PME_out_plus_diff_Files.tar > TAR >
>>>>>> PME_out_plus_diff_Files.tar.gz -
>>>>>> >> > poskozeny archiv
>>>>>> >> > PME_out_plus_diff_Files.tar.gz > GZIP >
>>>>>> >> > PME_out_plus_diff_Files.tar > TAR >
>>>>>> PME_out_plus_diff_Files.tar.gz >
>>>>>> >> > GZIP > PME_out_plus_diff_Files.tar - poskozeny archiv
>>>>>> >> > PME_out_plus_diff_Files.tar.gz > GZIP >
>>>>>> >> > PME_out_plus_diff_Filestar > TAR > PME_out_plus_diff_Files.tar.gz
>>>>>> >
>>>>>> >> GZIP
>>>>>> >> > > PME_out_plus_diff_Files.tar > TAR >
>>>>>> >> > PME_JAC_production_NPT-sim3.**mdout-full - vyskytl se problem
>>>>>> pri
>>>>>> cteni
>>>>>> >> > archivu
>>>>>> >> >
>>>>>> >> > http://www.eset.cz
>>>>>> >> >
>>>>>> >>
>>>>>> >>
>>>>>> >> --
>>>>>> >> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>>>>>> >> http://www.opera.com/mail/
>>>>>> >>
>>>>>> >> ______________________________**_________________
>>>>>> >> AMBER mailing list
>>>>>> >> AMBER.ambermd.org
>>>>>> >> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>>> > ______________________________**_________________
>>>>>> > AMBER mailing list
>>>>>> > AMBER.ambermd.org
>>>>>> > http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>>> >
>>>>>> > __________ Informace od ESET NOD32 Antivirus, verze databaze 8398
>>>>>> > (20130531) __________
>>>>>> >
>>>>>> > Tuto zpravu proveril ESET NOD32 Antivirus.
>>>>>> >
>>>>>> > http://www.eset.cz
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>>>>>> http://www.opera.com/mail/
>>>>>>
>>>>>> ______________________________**_________________
>>>>>> AMBER mailing list
>>>>>> AMBER.ambermd.org
>>>>>> http://lists.ambermd.org/**mailman/listinfo/amber<http://lists.ambermd.org/mailman/listinfo/amber>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> __________ Informace od ESET NOD32 Antivirus, verze databaze 8401
>>>> (20130601) __________
>>>>
>>>> Tuto zpravu proveril ESET NOD32 Antivirus.
>>>>
>>>> http://www.eset.cz
>>>>
>>>>
>>>
>>>
>>
>> --
>> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>> http://www.opera.com/mail/
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>>
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sun Jun 02 2013 - 10:00:02 PDT