Re: [AMBER] GTX780 News

From: Ross Walker <ross.rosswalker.co.uk>
Date: Thu, 11 Jul 2013 20:24:56 -0700

Hi ET,

Ok, I am retrying my GTX780 runs with nstlim=1000000 and JAC NPT. 10 x 4
GPU for 40 runs. I'll let you know how it goes. One thing to note though,
if it is temperature related my box is a rack mount system with
professional front to back cooling, like in the attached flyer, and NOT a
homebrew desktop so it is very possible that you are seeing problems
because your box cooling is not as good. Pure speculation on my part
though as I don't have any data to back this up.

As for the COM velocity printing - that is normal - here's the code in
runmd.F90

    if (nscm .ne. 0) then
       if (mod(total_nstep + 1, nscm) .eq. 0) then
          if (ifbox .ne. 0) then
             if (.not. doing_langevin) then
              ..
              ..
              if (master) then
                 velocity2 = vcm(1) * vcm(1) + vcm(2) * vcm(2) + vcm(3) *
vcm(3)
                 write(mdout,'(a,f15.6,f9.2,a)') 'check COM velocity,
temp: ', &
                 sqrt(velocity2), 0.5d0 * tmass * &
                 velocity2 / fac(1), '(Removed)'
               end if

so it gets printed whenever nscm /= 0 - it is 1000 by default, only
whenever your step count +1 macheds nscm, when periodic boundaries are on
and when you are NOT running langevin (ntt=3). So this is true for the JAC
NPT case which is why it is printed. What


You typically only notice this if you set ntpr >> nscm. Normally say ntpr
= 100 you would see the check COM message printed only every 10th entry in
mdout and the human eye misses it. Change ntpr to a much bigger number and
you get lots of check COMs printed (every 1000 steps) between each energy
printout so the human eye is drawn to it and notices.


All the best
Ross


On 7/11/13 11:46 AM, "ET" <sketchfoot.gmail.com> wrote:

>so do you think if I want trouble free GPU runs I'm best off returning all
>the 780's, even the one that has not failed any of the
>benchmarks/reproducibility errors?
>
>Also for my reference are multiple "check COM velocity, temp" type of
>entries in the mdout (on completing a run successfully) OK or not?
>
>check COM velocity, temp: 0.000064 0.00(Removed)
>check COM velocity, temp: 0.000035 0.00(Removed)
>check COM velocity, temp: 0.000020 0.00(Removed)
>check COM velocity, temp: 0.000031 0.00(Removed)
>check COM velocity, temp: 0.000045 0.00(Removed)
>check COM velocity, temp: 0.000023 0.00(Removed)
>check COM velocity, temp: 0.000010 0.00(Removed)
>check COM velocity, temp: 0.000022 0.00(Removed)
>check COM velocity, temp: 0.000044 0.00(Removed)
>check COM velocity, temp: 0.000047 0.00(Removed)
>check COM velocity, temp: 0.000014 0.00(Removed)
>check COM velocity, temp: 0.000032 0.00(Removed)
>check COM velocity, temp: 0.000037 0.00(Removed)
>
>many thanks
>
>
>On 11 July 2013 17:47, Scott Le Grand <varelse2005.gmail.com> wrote:
>
>> So alas this doesn't surprise me at all. The skinnb errors happen when
>>the
>> simulation goes to NaNaland in this case. What this comes down to right
>> now is that I don't trust *any* consumer GK110s.
>>
>>
>>
>>
>> On Thu, Jul 11, 2013 at 9:23 AM, ET <sketchfoot.gmail.com> wrote:
>>
>> > 1) Initially, I ran the full set amber benchmarks at the standard
>> setiings
>> > (100k steps)
>> >
>> > All the cards passed without issue
>> >
>> > 2) Increased nstlim to 200k steps
>> >
>> > one card outright crashed with the error: Nonbond cells need to be
>> > recalculated, restart simulation from previous checkpoint
>> > with a higher value for skinnb.
>> >
>> > reproducibility errors occuring in two other cards - in JAC NPT &
>> Cellulose
>> > NPT.
>> >
>> > 3) At this point decided to concentrate on JAC NPT as it is the
>>largest
>> > source of errors and ntslim can be extended without that much of a
>>time
>> > penalty.
>> > So I extended nstlim to 2500000 and ran all cards simultaneously,
>>albeit
>> > with staggered start times to offset disk I/O.
>> >
>> > mdin:
>> > ntx=5, irest=1,
>> > ntc=2, ntf=2,
>> > nstlim=2500000,
>> > ntpr=25000, ntwx=25000,
>> > ntwr=250000,
>> > dt=0.002, cut=8.,
>> > ntt=1, tautp=10.0,
>> > temp0=300.0,
>> > ntb=2, ntp=1, taup=10.0,
>> > ioutfm=1,ig=43689,
>> >
>> >
>> > The card in PCI slot 0 never failed. The other 3 cards (named after
>>which
>> > PCIe slot they occupied) always failed in the following order:
>> >
>> > card1 = within the first 10-20 mins
>> > card2 = shortly after card1
>> > card3 = takes a long time to fail. Almost gets to the end and
>>sometimes
>> > makes it
>> >
>> > The failure error was always a skinnb type error
>> >
>> > Obviously it was quite suspicious that only card0 in the primary PCIe
>> slot
>> > passed and I thought it may have something to do with the switching
>> > function on the plex chip interfering with things. So I took all the
>> cards
>> > out and tested them individually in PCIe 0. All of them failed with
>>the
>> > skinnb error. Additionally, every step in the mdout file is populated
>> with:
>> >
>> > ########################################
>> > check COM velocity, temp: 0.000028 0.00(Removed)
>> > check COM velocity, temp: 0.000037 0.00(Removed)
>> > check COM velocity, temp: 0.000032 0.00(Removed)
>> > check COM velocity, temp: 0.000032 0.00(Removed)
>> >
>> > NSTEP = 25000 TIME(PS) = 56.000 TEMP(K) = 300.60 PRESS
>>=
>> > -254.5
>> > Etot = -58129.2013 EKtot = 14450.1387 EPtot =
>> > -72579.3399
>> > BOND = 473.9411 ANGLE = 1296.5580 DIHED =
>> > 977.4736
>> > 1-4 NB = 551.6041 1-4 EEL = 6656.6898 VDWAALS =
>> > 8413.2767
>> > EELEC = -90948.8832 EHBOND = 0.0000 RESTRAINT =
>> > 0.0000
>> > EKCMT = 6304.1052 VIRIAL = 7593.8283 VOLUME =
>> > 234670.0662
>> > Density =
>> > 1.0226
>> >
>> >
>>
>>-------------------------------------------------------------------------
>>-----
>> >
>> > check COM velocity, temp: 0.000064 0.00(Removed)
>> > check COM velocity, temp: 0.000035 0.00(Removed)
>> > check COM velocity, temp: 0.000020 0.00(Removed)
>> > check COM velocity, temp: 0.000031 0.00(Removed)
>> > check COM velocity, temp: 0.000045 0.00(Removed)
>> > check COM velocity, temp: 0.000023 0.00(Removed)
>> > check COM velocity, temp: 0.000010 0.00(Removed)
>> > check COM velocity, temp: 0.000022 0.00(Removed)
>> > check COM velocity, temp: 0.000044 0.00(Removed)
>> > check COM velocity, temp: 0.000047 0.00(Removed)
>> > check COM velocity, temp: 0.000014 0.00(Removed)
>> > check COM velocity, temp: 0.000032 0.00(Removed)
>> > check COM velocity, temp: 0.000037 0.00(Removed)
>> > check COM velocity, temp: 0.000017 0.00(Removed)
>> > check COM velocity, temp: 0.000040 0.00(Removed)
>> > check COM velocity, temp: 0.000028 0.00(Removed)
>> > check COM velocity, temp: 0.000032 0.00(Removed)
>> > check COM velocity, temp: 0.000014 0.00(Removed)
>> > check COM velocity, temp: 0.000030 0.00(Removed)
>> > check COM velocity, temp: 0.000042 0.00(Removed)
>> > check COM velocity, temp: 0.000036 0.00(Removed)
>> > check COM velocity, temp: 0.000027 0.00(Removed)
>> > check COM velocity, temp: 0.000040 0.00(Removed)
>> > check COM velocity, temp: 0.000026 0.00(Removed)
>> > check COM velocity, temp: 0.000053 0.00(Removed)
>> >
>> > NSTEP = 50000 TIME(PS) = 106.000 TEMP(K) = 299.91 PRESS
>>=
>> > 60.9
>> > Etot = -58186.5909 EKtot = 14416.8516 EPtot =
>> > -72603.4424
>> > BOND = 468.9608 ANGLE = 1272.6458 DIHED =
>> > 1000.8139
>> > 1-4 NB = 554.1092 1-4 EEL = 6681.9525 VDWAALS =
>> > 8584.4464
>> > EELEC = -91166.3710 EHBOND = 0.0000 RESTRAINT =
>> > 0.0000
>> > EKCMT = 6287.7987 VIRIAL = 5978.9791 VOLUME =
>> > 234711.1150
>> > Density =
>> > 1.0224
>> >
>> >
>>
>>-------------------------------------------------------------------------
>>-----
>> >
>> > check COM velocity, temp: 0.000048 0.00(Removed)
>> > check COM velocity, temp: 0.000044 0.00(Removed)
>> > check COM velocity, temp: 0.000034 0.00(Removed)
>> > check COM velocity, temp: 0.000018 0.00(Removed)
>> > ########################################
>> >
>> > I have not seen this before, am not sure whether this is normal or
>>not.
>> If
>> > someone could clarify it would be appreciated.
>> >
>> > 4) I put card0 back into the box on its own and I ran 2x100ns of
>> production
>> > simulation of HIV-protease NPT with no issues. So am pretty convinced
>> that
>> > this card is good.
>> >
>> >
>> > With my 4x780 setup 3 of the cards failed with errors and had NPT
>> > deterministic issues when they did not crash, which seems very bad
>>luck
>> > considering Ross tested a 4-GPU combo with no failures at all. I
>>thought
>> > that this may have something to do with the particular batches of the
>> card
>> > that have been produced at various times. So i checked all the serial
>> > numbers printed on the hardware. The serial numbers, etc were all the
>> same,
>> > but what was quite weird was that only the card that was working had a
>> > distinctive stamp:
>> >
>> > "T7-E5"
>> >
>> > Probably it's nothing, but it would be interesting to know whether any
>> > other owners of working Zotac 780's have this stamp or not.
>> >
>> > Going to RMA 3x Zotacs now and go to 680s.
>> >
>> >
>> >
>> >
>> > On 8 July 2013 08:29, ET <sketchfoot.gmail.com> wrote:
>> >
>> > > !ai caramba! :/
>> > >
>> > > it looks like 3 of the cards are consistently failing with skinnb
>> errors
>> > > on.....
>> > >
>> > >
>> > > you guessed it:
>> > >
>> > > JAC NPT
>> > >
>> > > Have been running tests this weekend. Will post my findings later
>>today
>> > or
>> > > tomorrow.
>> > >
>> > >
>> > > On 3 July 2013 12:58, ET <sketchfoot.gmail.com> wrote:
>> > >
>> > >> FYI: Just got 2x Zotac 780s and ran the benchmark tests.
>> > >>
>> > >> All the tests were reproducible across 2x repeats.
>> > >>
>> > >> Going to get a couple of more today.
>> > >>
>> > >> br,
>> > >> g
>> > >>
>> > >>
>> > >> On 27 June 2013 21:43, ET <sketchfoot.gmail.com> wrote:
>> > >>
>> > >>> no worries. :) Already RMA's 2x Titans and bought 2x Zotacs. Will
>> check
>> > >>> 'em tomorrow. If they are good will order another 2.
>> > >>>
>> > >>> Thanks again for testing them.
>> > >>>
>> > >>>
>> > >>> On 27 June 2013 19:43, Ross Walker <ross.rosswalker.co.uk> wrote:
>> > >>>
>> > >>>> The GTX780s do not appear to be broken - we are just being
>>cautious
>> > >>>> right
>> > >>>> now.
>> > >>>>
>> > >>>> The Titan's are broken for everyone right now - well broken for
>> anyone
>> > >>>> who
>> > >>>> actually hits what they are broken with - which is still being
>> > >>>> investigated. But certainly for anyone who uses cuFFT the Titan's
>> > appear
>> > >>>> to broken right now.
>> > >>>>
>> > >>>> All the best
>> > >>>> Ross
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>> On 6/27/13 11:20 AM, "ET" <sketchfoot.gmail.com> wrote:
>> > >>>>
>> > >>>> >Are they "broken" only in terms of AMBER? Or could this be
>>classed
>> > as a
>> > >>>> >general hardware fault pertaining to all applications that use
>>the
>> > >>>> card?
>> > >>>> >
>> > >>>> >
>> > >>>> >
>> > >>>> >
>> > >>>> >On 27 June 2013 18:50, Scott Le Grand <varelse2005.gmail.com>
>> wrote:
>> > >>>> >
>> > >>>> >> It's not really a question of how it's programmed, it's a
>> question
>> > of
>> > >>>> >> manufacturing. One picks 12 out of 15 processor cores on the
>> chip
>> > >>>> >>itself
>> > >>>> >> to make a GTX 780 as opposed to picking 14 out of 15 processor
>> > cores
>> > >>>> to
>> > >>>> >> make a GTX Titan. In the former, there are 455 ways to do so
>>and
>> > in
>> > >>>> the
>> > >>>> >> latter, 15.
>> > >>>> >>
>> > >>>> >>
>> > >>>> >>
>> > >>>> >>
>> > >>>> >>
>> > >>>> >>
>> > >>>> >> On Wed, Jun 26, 2013 at 7:13 PM, ET <sketchfoot.gmail.com>
>> wrote:
>> > >>>> >>
>> > >>>> >> > Thanks very much for the quick information guys! It's much
>> > >>>> >>appreciated.
>> > >>>> >> >
>> > >>>> >> > I'm not that up on the manner in which these cards are
>> > programmed,
>> > >>>> so
>> > >>>> >>am
>> > >>>> >> a
>> > >>>> >> > little confused by your explanation Scott. could you please
>> > >>>> clarify it
>> > >>>> >> for
>> > >>>> >> > me?
>> > >>>> >> >
>> > >>>> >> > br,
>> > >>>> >> > g
>> > >>>> >> >
>> > >>>> >> >
>> > >>>> >> > On 27 June 2013 01:47, Scott Le Grand
>><varelse2005.gmail.com>
>> > >>>> wrote:
>> > >>>> >> >
>> > >>>> >> > > To clarify, there are 15 SMXs in a GK110 GPU. For GTX
>>Titan,
>> > >>>> one of
>> > >>>> >> them
>> > >>>> >> > > is disabled. There are 15 (15 choose 1) ways to do this.
>> All
>> > of
>> > >>>> >>them
>> > >>>> >> > seem
>> > >>>> >> > > to be broken.
>> > >>>> >> > >
>> > >>>> >> > > There are 12 out of 15 active SMXs in GTX 780. That means
>> there
>> > >>>> are
>> > >>>> >>455
>> > >>>> >> > (15
>> > >>>> >> > > choose 3) ways to make one. I'm a little nervous that
>>some
>> of
>> > >>>> those
>> > >>>> >> > > configurations may be broken, so the best thing to do is
>>to
>> > test
>> > >>>> if
>> > >>>> >> they
>> > >>>> >> > > exhibit deterministic behavior upon acquiring them, and if
>> they
>> > >>>> >>don't,
>> > >>>> >> > RMA
>> > >>>> >> > > them as defective.
>> > >>>> >> > >
>> > >>>> >> > >
>> > >>>> >> > >
>> > >>>> >> > >
>> > >>>> >> > >
>> > >>>> >> > >
>> > >>>> >> > > On Wed, Jun 26, 2013 at 4:31 PM, Ross Walker <
>> > >>>> ross.rosswalker.co.uk>
>> > >>>> >> > > wrote:
>> > >>>> >> > >
>> > >>>> >> > > > Hi All,
>> > >>>> >> > > >
>> > >>>> >> > > > Ok, good news on the GTX780 front. After 4 days of
>>testing
>> > >>>> neither
>> > >>>> >> > Scott
>> > >>>> >> > > > nor myself have been able to break the GTX780s. This is
>>in
>> a
>> > 4
>> > >>>> x
>> > >>>> >> GTX780
>> > >>>> >> > > > Exxact system although at present we have only tested
>> > multiple
>> > >>>> >>single
>> > >>>> >> > GPU
>> > >>>> >> > > > runs using all 4 GPUs at once - I.e. pmemd.cuda (NOT
>> > >>>> >>pmemd.cuda.MPI)
>> > >>>> >> -
>> > >>>> >> > I
>> > >>>> >> > > > will be testing pmemd.cuda.MPI shortly but I don't see
>>why
>> > this
>> > >>>> >> > wouldn't
>> > >>>> >> > > > work given single GPU is working fine.
>> > >>>> >> > > >
>> > >>>> >> > > > Key though is that there are multiple ways to build
>> GTX780s,
>> > >>>> and
>> > >>>> >>for
>> > >>>> >> > now
>> > >>>> >> > > > we have only tested one specific model which is as
>>follows:
>> > >>>> >> > > >
>> > >>>> >> > > > http://tinyurl.com/prxlwy6 Zotac GTX780 ZT-70201-10P
>> > >>>> >> > > >
>> > >>>> >> > > >
>> > >>>> >> > > > Until we have an opportunity to test different vendor
>> GTX780s
>> > >>>> and
>> > >>>> >>OC
>> > >>>> >> > > > versions the advice is to stick with the above model if
>>you
>> > >>>> can.
>> > >>>> >> > > >
>> > >>>> >> > > > All the best
>> > >>>> >> > > > Ross
>> > >>>> >> > > >
>> > >>>> >> > > > /\
>> > >>>> >> > > > \/
>> > >>>> >> > > > |\oss Walker
>> > >>>> >> > > >
>> > >>>> >> > > >
>>---------------------------------------------------------
>> > >>>> >> > > > | Associate Research Professor
>>|
>> > >>>> >> > > > | San Diego Supercomputer Center
>>|
>> > >>>> >> > > > | Adjunct Associate Professor
>>|
>> > >>>> >> > > > | Dept. of Chemistry and Biochemistry
>>|
>> > >>>> >> > > > | University of California San Diego
>>|
>> > >>>> >> > > > | NVIDIA Fellow
>>|
>> > >>>> >> > > > | http://www.rosswalker.co.uk | http://www.wmd-lab.org
>>|
>> > >>>> >> > > > | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk
>>|
>> > >>>> >> > > >
>>---------------------------------------------------------
>> > >>>> >> > > >
>> > >>>> >> > > > Note: Electronic Mail is not secure, has no guarantee of
>> > >>>> delivery,
>> > >>>> >> may
>> > >>>> >> > > not
>> > >>>> >> > > > be read every day, and should not be used for urgent or
>> > >>>> sensitive
>> > >>>> >> > issues.
>> > >>>> >> > > >
>> > >>>> >> > > >
>> > >>>> >> > > >
>> > >>>> >> > > >
>> > >>>> >> > > >
>> > >>>> >> > > >
>> > >>>> >> > > >
>> > >>>> >> > > > _______________________________________________
>> > >>>> >> > > > AMBER mailing list
>> > >>>> >> > > > AMBER.ambermd.org
>> > >>>> >> > > > http://lists.ambermd.org/mailman/listinfo/amber
>> > >>>> >> > > >
>> > >>>> >> > > _______________________________________________
>> > >>>> >> > > AMBER mailing list
>> > >>>> >> > > AMBER.ambermd.org
>> > >>>> >> > > http://lists.ambermd.org/mailman/listinfo/amber
>> > >>>> >> > >
>> > >>>> >> > _______________________________________________
>> > >>>> >> > AMBER mailing list
>> > >>>> >> > AMBER.ambermd.org
>> > >>>> >> > http://lists.ambermd.org/mailman/listinfo/amber
>> > >>>> >> >
>> > >>>> >> _______________________________________________
>> > >>>> >> AMBER mailing list
>> > >>>> >> AMBER.ambermd.org
>> > >>>> >> http://lists.ambermd.org/mailman/listinfo/amber
>> > >>>> >>
>> > >>>> >_______________________________________________
>> > >>>> >AMBER mailing list
>> > >>>> >AMBER.ambermd.org
>> > >>>> >http://lists.ambermd.org/mailman/listinfo/amber
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>> _______________________________________________
>> > >>>> AMBER mailing list
>> > >>>> AMBER.ambermd.org
>> > >>>> http://lists.ambermd.org/mailman/listinfo/amber
>> > >>>>
>> > >>>
>> > >>>
>> > >>
>> > >
>> > _______________________________________________
>> > AMBER mailing list
>> > AMBER.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber
>> >
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber



_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Thu Jul 11 2013 - 20:30:06 PDT
Custom Search