Re: [AMBER] GTX780 News

From: ET <sketchfoot.gmail.com>
Date: Thu, 11 Jul 2013 19:46:33 +0100

so do you think if I want trouble free GPU runs I'm best off returning all
the 780's, even the one that has not failed any of the
benchmarks/reproducibility errors?

Also for my reference are multiple "check COM velocity, temp" type of
entries in the mdout (on completing a run successfully) OK or not?

check COM velocity, temp: 0.000064 0.00(Removed)
check COM velocity, temp: 0.000035 0.00(Removed)
check COM velocity, temp: 0.000020 0.00(Removed)
check COM velocity, temp: 0.000031 0.00(Removed)
check COM velocity, temp: 0.000045 0.00(Removed)
check COM velocity, temp: 0.000023 0.00(Removed)
check COM velocity, temp: 0.000010 0.00(Removed)
check COM velocity, temp: 0.000022 0.00(Removed)
check COM velocity, temp: 0.000044 0.00(Removed)
check COM velocity, temp: 0.000047 0.00(Removed)
check COM velocity, temp: 0.000014 0.00(Removed)
check COM velocity, temp: 0.000032 0.00(Removed)
check COM velocity, temp: 0.000037 0.00(Removed)

many thanks


On 11 July 2013 17:47, Scott Le Grand <varelse2005.gmail.com> wrote:

> So alas this doesn't surprise me at all. The skinnb errors happen when the
> simulation goes to NaNaland in this case. What this comes down to right
> now is that I don't trust *any* consumer GK110s.
>
>
>
>
> On Thu, Jul 11, 2013 at 9:23 AM, ET <sketchfoot.gmail.com> wrote:
>
> > 1) Initially, I ran the full set amber benchmarks at the standard
> setiings
> > (100k steps)
> >
> > All the cards passed without issue
> >
> > 2) Increased nstlim to 200k steps
> >
> > one card outright crashed with the error: Nonbond cells need to be
> > recalculated, restart simulation from previous checkpoint
> > with a higher value for skinnb.
> >
> > reproducibility errors occuring in two other cards - in JAC NPT &
> Cellulose
> > NPT.
> >
> > 3) At this point decided to concentrate on JAC NPT as it is the largest
> > source of errors and ntslim can be extended without that much of a time
> > penalty.
> > So I extended nstlim to 2500000 and ran all cards simultaneously, albeit
> > with staggered start times to offset disk I/O.
> >
> > mdin:
> > ntx=5, irest=1,
> > ntc=2, ntf=2,
> > nstlim=2500000,
> > ntpr=25000, ntwx=25000,
> > ntwr=250000,
> > dt=0.002, cut=8.,
> > ntt=1, tautp=10.0,
> > temp0=300.0,
> > ntb=2, ntp=1, taup=10.0,
> > ioutfm=1,ig=43689,
> >
> >
> > The card in PCI slot 0 never failed. The other 3 cards (named after which
> > PCIe slot they occupied) always failed in the following order:
> >
> > card1 = within the first 10-20 mins
> > card2 = shortly after card1
> > card3 = takes a long time to fail. Almost gets to the end and sometimes
> > makes it
> >
> > The failure error was always a skinnb type error
> >
> > Obviously it was quite suspicious that only card0 in the primary PCIe
> slot
> > passed and I thought it may have something to do with the switching
> > function on the plex chip interfering with things. So I took all the
> cards
> > out and tested them individually in PCIe 0. All of them failed with the
> > skinnb error. Additionally, every step in the mdout file is populated
> with:
> >
> > ########################################
> > check COM velocity, temp: 0.000028 0.00(Removed)
> > check COM velocity, temp: 0.000037 0.00(Removed)
> > check COM velocity, temp: 0.000032 0.00(Removed)
> > check COM velocity, temp: 0.000032 0.00(Removed)
> >
> > NSTEP = 25000 TIME(PS) = 56.000 TEMP(K) = 300.60 PRESS =
> > -254.5
> > Etot = -58129.2013 EKtot = 14450.1387 EPtot =
> > -72579.3399
> > BOND = 473.9411 ANGLE = 1296.5580 DIHED =
> > 977.4736
> > 1-4 NB = 551.6041 1-4 EEL = 6656.6898 VDWAALS =
> > 8413.2767
> > EELEC = -90948.8832 EHBOND = 0.0000 RESTRAINT =
> > 0.0000
> > EKCMT = 6304.1052 VIRIAL = 7593.8283 VOLUME =
> > 234670.0662
> > Density =
> > 1.0226
> >
> >
> ------------------------------------------------------------------------------
> >
> > check COM velocity, temp: 0.000064 0.00(Removed)
> > check COM velocity, temp: 0.000035 0.00(Removed)
> > check COM velocity, temp: 0.000020 0.00(Removed)
> > check COM velocity, temp: 0.000031 0.00(Removed)
> > check COM velocity, temp: 0.000045 0.00(Removed)
> > check COM velocity, temp: 0.000023 0.00(Removed)
> > check COM velocity, temp: 0.000010 0.00(Removed)
> > check COM velocity, temp: 0.000022 0.00(Removed)
> > check COM velocity, temp: 0.000044 0.00(Removed)
> > check COM velocity, temp: 0.000047 0.00(Removed)
> > check COM velocity, temp: 0.000014 0.00(Removed)
> > check COM velocity, temp: 0.000032 0.00(Removed)
> > check COM velocity, temp: 0.000037 0.00(Removed)
> > check COM velocity, temp: 0.000017 0.00(Removed)
> > check COM velocity, temp: 0.000040 0.00(Removed)
> > check COM velocity, temp: 0.000028 0.00(Removed)
> > check COM velocity, temp: 0.000032 0.00(Removed)
> > check COM velocity, temp: 0.000014 0.00(Removed)
> > check COM velocity, temp: 0.000030 0.00(Removed)
> > check COM velocity, temp: 0.000042 0.00(Removed)
> > check COM velocity, temp: 0.000036 0.00(Removed)
> > check COM velocity, temp: 0.000027 0.00(Removed)
> > check COM velocity, temp: 0.000040 0.00(Removed)
> > check COM velocity, temp: 0.000026 0.00(Removed)
> > check COM velocity, temp: 0.000053 0.00(Removed)
> >
> > NSTEP = 50000 TIME(PS) = 106.000 TEMP(K) = 299.91 PRESS =
> > 60.9
> > Etot = -58186.5909 EKtot = 14416.8516 EPtot =
> > -72603.4424
> > BOND = 468.9608 ANGLE = 1272.6458 DIHED =
> > 1000.8139
> > 1-4 NB = 554.1092 1-4 EEL = 6681.9525 VDWAALS =
> > 8584.4464
> > EELEC = -91166.3710 EHBOND = 0.0000 RESTRAINT =
> > 0.0000
> > EKCMT = 6287.7987 VIRIAL = 5978.9791 VOLUME =
> > 234711.1150
> > Density =
> > 1.0224
> >
> >
> ------------------------------------------------------------------------------
> >
> > check COM velocity, temp: 0.000048 0.00(Removed)
> > check COM velocity, temp: 0.000044 0.00(Removed)
> > check COM velocity, temp: 0.000034 0.00(Removed)
> > check COM velocity, temp: 0.000018 0.00(Removed)
> > ########################################
> >
> > I have not seen this before, am not sure whether this is normal or not.
> If
> > someone could clarify it would be appreciated.
> >
> > 4) I put card0 back into the box on its own and I ran 2x100ns of
> production
> > simulation of HIV-protease NPT with no issues. So am pretty convinced
> that
> > this card is good.
> >
> >
> > With my 4x780 setup 3 of the cards failed with errors and had NPT
> > deterministic issues when they did not crash, which seems very bad luck
> > considering Ross tested a 4-GPU combo with no failures at all. I thought
> > that this may have something to do with the particular batches of the
> card
> > that have been produced at various times. So i checked all the serial
> > numbers printed on the hardware. The serial numbers, etc were all the
> same,
> > but what was quite weird was that only the card that was working had a
> > distinctive stamp:
> >
> > "T7-E5"
> >
> > Probably it's nothing, but it would be interesting to know whether any
> > other owners of working Zotac 780's have this stamp or not.
> >
> > Going to RMA 3x Zotacs now and go to 680s.
> >
> >
> >
> >
> > On 8 July 2013 08:29, ET <sketchfoot.gmail.com> wrote:
> >
> > > !ai caramba! :/
> > >
> > > it looks like 3 of the cards are consistently failing with skinnb
> errors
> > > on.....
> > >
> > >
> > > you guessed it:
> > >
> > > JAC NPT
> > >
> > > Have been running tests this weekend. Will post my findings later today
> > or
> > > tomorrow.
> > >
> > >
> > > On 3 July 2013 12:58, ET <sketchfoot.gmail.com> wrote:
> > >
> > >> FYI: Just got 2x Zotac 780s and ran the benchmark tests.
> > >>
> > >> All the tests were reproducible across 2x repeats.
> > >>
> > >> Going to get a couple of more today.
> > >>
> > >> br,
> > >> g
> > >>
> > >>
> > >> On 27 June 2013 21:43, ET <sketchfoot.gmail.com> wrote:
> > >>
> > >>> no worries. :) Already RMA's 2x Titans and bought 2x Zotacs. Will
> check
> > >>> 'em tomorrow. If they are good will order another 2.
> > >>>
> > >>> Thanks again for testing them.
> > >>>
> > >>>
> > >>> On 27 June 2013 19:43, Ross Walker <ross.rosswalker.co.uk> wrote:
> > >>>
> > >>>> The GTX780s do not appear to be broken - we are just being cautious
> > >>>> right
> > >>>> now.
> > >>>>
> > >>>> The Titan's are broken for everyone right now - well broken for
> anyone
> > >>>> who
> > >>>> actually hits what they are broken with - which is still being
> > >>>> investigated. But certainly for anyone who uses cuFFT the Titan's
> > appear
> > >>>> to broken right now.
> > >>>>
> > >>>> All the best
> > >>>> Ross
> > >>>>
> > >>>>
> > >>>>
> > >>>> On 6/27/13 11:20 AM, "ET" <sketchfoot.gmail.com> wrote:
> > >>>>
> > >>>> >Are they "broken" only in terms of AMBER? Or could this be classed
> > as a
> > >>>> >general hardware fault pertaining to all applications that use the
> > >>>> card?
> > >>>> >
> > >>>> >
> > >>>> >
> > >>>> >
> > >>>> >On 27 June 2013 18:50, Scott Le Grand <varelse2005.gmail.com>
> wrote:
> > >>>> >
> > >>>> >> It's not really a question of how it's programmed, it's a
> question
> > of
> > >>>> >> manufacturing. One picks 12 out of 15 processor cores on the
> chip
> > >>>> >>itself
> > >>>> >> to make a GTX 780 as opposed to picking 14 out of 15 processor
> > cores
> > >>>> to
> > >>>> >> make a GTX Titan. In the former, there are 455 ways to do so and
> > in
> > >>>> the
> > >>>> >> latter, 15.
> > >>>> >>
> > >>>> >>
> > >>>> >>
> > >>>> >>
> > >>>> >>
> > >>>> >>
> > >>>> >> On Wed, Jun 26, 2013 at 7:13 PM, ET <sketchfoot.gmail.com>
> wrote:
> > >>>> >>
> > >>>> >> > Thanks very much for the quick information guys! It's much
> > >>>> >>appreciated.
> > >>>> >> >
> > >>>> >> > I'm not that up on the manner in which these cards are
> > programmed,
> > >>>> so
> > >>>> >>am
> > >>>> >> a
> > >>>> >> > little confused by your explanation Scott. could you please
> > >>>> clarify it
> > >>>> >> for
> > >>>> >> > me?
> > >>>> >> >
> > >>>> >> > br,
> > >>>> >> > g
> > >>>> >> >
> > >>>> >> >
> > >>>> >> > On 27 June 2013 01:47, Scott Le Grand <varelse2005.gmail.com>
> > >>>> wrote:
> > >>>> >> >
> > >>>> >> > > To clarify, there are 15 SMXs in a GK110 GPU. For GTX Titan,
> > >>>> one of
> > >>>> >> them
> > >>>> >> > > is disabled. There are 15 (15 choose 1) ways to do this.
> All
> > of
> > >>>> >>them
> > >>>> >> > seem
> > >>>> >> > > to be broken.
> > >>>> >> > >
> > >>>> >> > > There are 12 out of 15 active SMXs in GTX 780. That means
> there
> > >>>> are
> > >>>> >>455
> > >>>> >> > (15
> > >>>> >> > > choose 3) ways to make one. I'm a little nervous that some
> of
> > >>>> those
> > >>>> >> > > configurations may be broken, so the best thing to do is to
> > test
> > >>>> if
> > >>>> >> they
> > >>>> >> > > exhibit deterministic behavior upon acquiring them, and if
> they
> > >>>> >>don't,
> > >>>> >> > RMA
> > >>>> >> > > them as defective.
> > >>>> >> > >
> > >>>> >> > >
> > >>>> >> > >
> > >>>> >> > >
> > >>>> >> > >
> > >>>> >> > >
> > >>>> >> > > On Wed, Jun 26, 2013 at 4:31 PM, Ross Walker <
> > >>>> ross.rosswalker.co.uk>
> > >>>> >> > > wrote:
> > >>>> >> > >
> > >>>> >> > > > Hi All,
> > >>>> >> > > >
> > >>>> >> > > > Ok, good news on the GTX780 front. After 4 days of testing
> > >>>> neither
> > >>>> >> > Scott
> > >>>> >> > > > nor myself have been able to break the GTX780s. This is in
> a
> > 4
> > >>>> x
> > >>>> >> GTX780
> > >>>> >> > > > Exxact system although at present we have only tested
> > multiple
> > >>>> >>single
> > >>>> >> > GPU
> > >>>> >> > > > runs using all 4 GPUs at once - I.e. pmemd.cuda (NOT
> > >>>> >>pmemd.cuda.MPI)
> > >>>> >> -
> > >>>> >> > I
> > >>>> >> > > > will be testing pmemd.cuda.MPI shortly but I don't see why
> > this
> > >>>> >> > wouldn't
> > >>>> >> > > > work given single GPU is working fine.
> > >>>> >> > > >
> > >>>> >> > > > Key though is that there are multiple ways to build
> GTX780s,
> > >>>> and
> > >>>> >>for
> > >>>> >> > now
> > >>>> >> > > > we have only tested one specific model which is as follows:
> > >>>> >> > > >
> > >>>> >> > > > http://tinyurl.com/prxlwy6 Zotac GTX780 ZT-70201-10P
> > >>>> >> > > >
> > >>>> >> > > >
> > >>>> >> > > > Until we have an opportunity to test different vendor
> GTX780s
> > >>>> and
> > >>>> >>OC
> > >>>> >> > > > versions the advice is to stick with the above model if you
> > >>>> can.
> > >>>> >> > > >
> > >>>> >> > > > All the best
> > >>>> >> > > > Ross
> > >>>> >> > > >
> > >>>> >> > > > /\
> > >>>> >> > > > \/
> > >>>> >> > > > |\oss Walker
> > >>>> >> > > >
> > >>>> >> > > > ---------------------------------------------------------
> > >>>> >> > > > | Associate Research Professor |
> > >>>> >> > > > | San Diego Supercomputer Center |
> > >>>> >> > > > | Adjunct Associate Professor |
> > >>>> >> > > > | Dept. of Chemistry and Biochemistry |
> > >>>> >> > > > | University of California San Diego |
> > >>>> >> > > > | NVIDIA Fellow |
> > >>>> >> > > > | http://www.rosswalker.co.uk | http://www.wmd-lab.org |
> > >>>> >> > > > | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
> > >>>> >> > > > ---------------------------------------------------------
> > >>>> >> > > >
> > >>>> >> > > > Note: Electronic Mail is not secure, has no guarantee of
> > >>>> delivery,
> > >>>> >> may
> > >>>> >> > > not
> > >>>> >> > > > be read every day, and should not be used for urgent or
> > >>>> sensitive
> > >>>> >> > issues.
> > >>>> >> > > >
> > >>>> >> > > >
> > >>>> >> > > >
> > >>>> >> > > >
> > >>>> >> > > >
> > >>>> >> > > >
> > >>>> >> > > >
> > >>>> >> > > > _______________________________________________
> > >>>> >> > > > AMBER mailing list
> > >>>> >> > > > AMBER.ambermd.org
> > >>>> >> > > > http://lists.ambermd.org/mailman/listinfo/amber
> > >>>> >> > > >
> > >>>> >> > > _______________________________________________
> > >>>> >> > > AMBER mailing list
> > >>>> >> > > AMBER.ambermd.org
> > >>>> >> > > http://lists.ambermd.org/mailman/listinfo/amber
> > >>>> >> > >
> > >>>> >> > _______________________________________________
> > >>>> >> > AMBER mailing list
> > >>>> >> > AMBER.ambermd.org
> > >>>> >> > http://lists.ambermd.org/mailman/listinfo/amber
> > >>>> >> >
> > >>>> >> _______________________________________________
> > >>>> >> AMBER mailing list
> > >>>> >> AMBER.ambermd.org
> > >>>> >> http://lists.ambermd.org/mailman/listinfo/amber
> > >>>> >>
> > >>>> >_______________________________________________
> > >>>> >AMBER mailing list
> > >>>> >AMBER.ambermd.org
> > >>>> >http://lists.ambermd.org/mailman/listinfo/amber
> > >>>>
> > >>>>
> > >>>>
> > >>>> _______________________________________________
> > >>>> AMBER mailing list
> > >>>> AMBER.ambermd.org
> > >>>> http://lists.ambermd.org/mailman/listinfo/amber
> > >>>>
> > >>>
> > >>>
> > >>
> > >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jul 11 2013 - 12:00:02 PDT
Custom Search