Re: [AMBER] GTX780 News

From: Scott Le Grand <varelse2005.gmail.com>
Date: Thu, 11 Jul 2013 09:47:54 -0700

So alas this doesn't surprise me at all. The skinnb errors happen when the
simulation goes to NaNaland in this case. What this comes down to right
now is that I don't trust *any* consumer GK110s.




On Thu, Jul 11, 2013 at 9:23 AM, ET <sketchfoot.gmail.com> wrote:

> 1) Initially, I ran the full set amber benchmarks at the standard setiings
> (100k steps)
>
> All the cards passed without issue
>
> 2) Increased nstlim to 200k steps
>
> one card outright crashed with the error: Nonbond cells need to be
> recalculated, restart simulation from previous checkpoint
> with a higher value for skinnb.
>
> reproducibility errors occuring in two other cards - in JAC NPT & Cellulose
> NPT.
>
> 3) At this point decided to concentrate on JAC NPT as it is the largest
> source of errors and ntslim can be extended without that much of a time
> penalty.
> So I extended nstlim to 2500000 and ran all cards simultaneously, albeit
> with staggered start times to offset disk I/O.
>
> mdin:
> ntx=5, irest=1,
> ntc=2, ntf=2,
> nstlim=2500000,
> ntpr=25000, ntwx=25000,
> ntwr=250000,
> dt=0.002, cut=8.,
> ntt=1, tautp=10.0,
> temp0=300.0,
> ntb=2, ntp=1, taup=10.0,
> ioutfm=1,ig=43689,
>
>
> The card in PCI slot 0 never failed. The other 3 cards (named after which
> PCIe slot they occupied) always failed in the following order:
>
> card1 = within the first 10-20 mins
> card2 = shortly after card1
> card3 = takes a long time to fail. Almost gets to the end and sometimes
> makes it
>
> The failure error was always a skinnb type error
>
> Obviously it was quite suspicious that only card0 in the primary PCIe slot
> passed and I thought it may have something to do with the switching
> function on the plex chip interfering with things. So I took all the cards
> out and tested them individually in PCIe 0. All of them failed with the
> skinnb error. Additionally, every step in the mdout file is populated with:
>
> ########################################
> check COM velocity, temp: 0.000028 0.00(Removed)
> check COM velocity, temp: 0.000037 0.00(Removed)
> check COM velocity, temp: 0.000032 0.00(Removed)
> check COM velocity, temp: 0.000032 0.00(Removed)
>
> NSTEP = 25000 TIME(PS) = 56.000 TEMP(K) = 300.60 PRESS =
> -254.5
> Etot = -58129.2013 EKtot = 14450.1387 EPtot =
> -72579.3399
> BOND = 473.9411 ANGLE = 1296.5580 DIHED =
> 977.4736
> 1-4 NB = 551.6041 1-4 EEL = 6656.6898 VDWAALS =
> 8413.2767
> EELEC = -90948.8832 EHBOND = 0.0000 RESTRAINT =
> 0.0000
> EKCMT = 6304.1052 VIRIAL = 7593.8283 VOLUME =
> 234670.0662
> Density =
> 1.0226
>
> ------------------------------------------------------------------------------
>
> check COM velocity, temp: 0.000064 0.00(Removed)
> check COM velocity, temp: 0.000035 0.00(Removed)
> check COM velocity, temp: 0.000020 0.00(Removed)
> check COM velocity, temp: 0.000031 0.00(Removed)
> check COM velocity, temp: 0.000045 0.00(Removed)
> check COM velocity, temp: 0.000023 0.00(Removed)
> check COM velocity, temp: 0.000010 0.00(Removed)
> check COM velocity, temp: 0.000022 0.00(Removed)
> check COM velocity, temp: 0.000044 0.00(Removed)
> check COM velocity, temp: 0.000047 0.00(Removed)
> check COM velocity, temp: 0.000014 0.00(Removed)
> check COM velocity, temp: 0.000032 0.00(Removed)
> check COM velocity, temp: 0.000037 0.00(Removed)
> check COM velocity, temp: 0.000017 0.00(Removed)
> check COM velocity, temp: 0.000040 0.00(Removed)
> check COM velocity, temp: 0.000028 0.00(Removed)
> check COM velocity, temp: 0.000032 0.00(Removed)
> check COM velocity, temp: 0.000014 0.00(Removed)
> check COM velocity, temp: 0.000030 0.00(Removed)
> check COM velocity, temp: 0.000042 0.00(Removed)
> check COM velocity, temp: 0.000036 0.00(Removed)
> check COM velocity, temp: 0.000027 0.00(Removed)
> check COM velocity, temp: 0.000040 0.00(Removed)
> check COM velocity, temp: 0.000026 0.00(Removed)
> check COM velocity, temp: 0.000053 0.00(Removed)
>
> NSTEP = 50000 TIME(PS) = 106.000 TEMP(K) = 299.91 PRESS =
> 60.9
> Etot = -58186.5909 EKtot = 14416.8516 EPtot =
> -72603.4424
> BOND = 468.9608 ANGLE = 1272.6458 DIHED =
> 1000.8139
> 1-4 NB = 554.1092 1-4 EEL = 6681.9525 VDWAALS =
> 8584.4464
> EELEC = -91166.3710 EHBOND = 0.0000 RESTRAINT =
> 0.0000
> EKCMT = 6287.7987 VIRIAL = 5978.9791 VOLUME =
> 234711.1150
> Density =
> 1.0224
>
> ------------------------------------------------------------------------------
>
> check COM velocity, temp: 0.000048 0.00(Removed)
> check COM velocity, temp: 0.000044 0.00(Removed)
> check COM velocity, temp: 0.000034 0.00(Removed)
> check COM velocity, temp: 0.000018 0.00(Removed)
> ########################################
>
> I have not seen this before, am not sure whether this is normal or not. If
> someone could clarify it would be appreciated.
>
> 4) I put card0 back into the box on its own and I ran 2x100ns of production
> simulation of HIV-protease NPT with no issues. So am pretty convinced that
> this card is good.
>
>
> With my 4x780 setup 3 of the cards failed with errors and had NPT
> deterministic issues when they did not crash, which seems very bad luck
> considering Ross tested a 4-GPU combo with no failures at all. I thought
> that this may have something to do with the particular batches of the card
> that have been produced at various times. So i checked all the serial
> numbers printed on the hardware. The serial numbers, etc were all the same,
> but what was quite weird was that only the card that was working had a
> distinctive stamp:
>
> "T7-E5"
>
> Probably it's nothing, but it would be interesting to know whether any
> other owners of working Zotac 780's have this stamp or not.
>
> Going to RMA 3x Zotacs now and go to 680s.
>
>
>
>
> On 8 July 2013 08:29, ET <sketchfoot.gmail.com> wrote:
>
> > !ai caramba! :/
> >
> > it looks like 3 of the cards are consistently failing with skinnb errors
> > on.....
> >
> >
> > you guessed it:
> >
> > JAC NPT
> >
> > Have been running tests this weekend. Will post my findings later today
> or
> > tomorrow.
> >
> >
> > On 3 July 2013 12:58, ET <sketchfoot.gmail.com> wrote:
> >
> >> FYI: Just got 2x Zotac 780s and ran the benchmark tests.
> >>
> >> All the tests were reproducible across 2x repeats.
> >>
> >> Going to get a couple of more today.
> >>
> >> br,
> >> g
> >>
> >>
> >> On 27 June 2013 21:43, ET <sketchfoot.gmail.com> wrote:
> >>
> >>> no worries. :) Already RMA's 2x Titans and bought 2x Zotacs. Will check
> >>> 'em tomorrow. If they are good will order another 2.
> >>>
> >>> Thanks again for testing them.
> >>>
> >>>
> >>> On 27 June 2013 19:43, Ross Walker <ross.rosswalker.co.uk> wrote:
> >>>
> >>>> The GTX780s do not appear to be broken - we are just being cautious
> >>>> right
> >>>> now.
> >>>>
> >>>> The Titan's are broken for everyone right now - well broken for anyone
> >>>> who
> >>>> actually hits what they are broken with - which is still being
> >>>> investigated. But certainly for anyone who uses cuFFT the Titan's
> appear
> >>>> to broken right now.
> >>>>
> >>>> All the best
> >>>> Ross
> >>>>
> >>>>
> >>>>
> >>>> On 6/27/13 11:20 AM, "ET" <sketchfoot.gmail.com> wrote:
> >>>>
> >>>> >Are they "broken" only in terms of AMBER? Or could this be classed
> as a
> >>>> >general hardware fault pertaining to all applications that use the
> >>>> card?
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>> >On 27 June 2013 18:50, Scott Le Grand <varelse2005.gmail.com> wrote:
> >>>> >
> >>>> >> It's not really a question of how it's programmed, it's a question
> of
> >>>> >> manufacturing. One picks 12 out of 15 processor cores on the chip
> >>>> >>itself
> >>>> >> to make a GTX 780 as opposed to picking 14 out of 15 processor
> cores
> >>>> to
> >>>> >> make a GTX Titan. In the former, there are 455 ways to do so and
> in
> >>>> the
> >>>> >> latter, 15.
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >> On Wed, Jun 26, 2013 at 7:13 PM, ET <sketchfoot.gmail.com> wrote:
> >>>> >>
> >>>> >> > Thanks very much for the quick information guys! It's much
> >>>> >>appreciated.
> >>>> >> >
> >>>> >> > I'm not that up on the manner in which these cards are
> programmed,
> >>>> so
> >>>> >>am
> >>>> >> a
> >>>> >> > little confused by your explanation Scott. could you please
> >>>> clarify it
> >>>> >> for
> >>>> >> > me?
> >>>> >> >
> >>>> >> > br,
> >>>> >> > g
> >>>> >> >
> >>>> >> >
> >>>> >> > On 27 June 2013 01:47, Scott Le Grand <varelse2005.gmail.com>
> >>>> wrote:
> >>>> >> >
> >>>> >> > > To clarify, there are 15 SMXs in a GK110 GPU. For GTX Titan,
> >>>> one of
> >>>> >> them
> >>>> >> > > is disabled. There are 15 (15 choose 1) ways to do this. All
> of
> >>>> >>them
> >>>> >> > seem
> >>>> >> > > to be broken.
> >>>> >> > >
> >>>> >> > > There are 12 out of 15 active SMXs in GTX 780. That means there
> >>>> are
> >>>> >>455
> >>>> >> > (15
> >>>> >> > > choose 3) ways to make one. I'm a little nervous that some of
> >>>> those
> >>>> >> > > configurations may be broken, so the best thing to do is to
> test
> >>>> if
> >>>> >> they
> >>>> >> > > exhibit deterministic behavior upon acquiring them, and if they
> >>>> >>don't,
> >>>> >> > RMA
> >>>> >> > > them as defective.
> >>>> >> > >
> >>>> >> > >
> >>>> >> > >
> >>>> >> > >
> >>>> >> > >
> >>>> >> > >
> >>>> >> > > On Wed, Jun 26, 2013 at 4:31 PM, Ross Walker <
> >>>> ross.rosswalker.co.uk>
> >>>> >> > > wrote:
> >>>> >> > >
> >>>> >> > > > Hi All,
> >>>> >> > > >
> >>>> >> > > > Ok, good news on the GTX780 front. After 4 days of testing
> >>>> neither
> >>>> >> > Scott
> >>>> >> > > > nor myself have been able to break the GTX780s. This is in a
> 4
> >>>> x
> >>>> >> GTX780
> >>>> >> > > > Exxact system although at present we have only tested
> multiple
> >>>> >>single
> >>>> >> > GPU
> >>>> >> > > > runs using all 4 GPUs at once - I.e. pmemd.cuda (NOT
> >>>> >>pmemd.cuda.MPI)
> >>>> >> -
> >>>> >> > I
> >>>> >> > > > will be testing pmemd.cuda.MPI shortly but I don't see why
> this
> >>>> >> > wouldn't
> >>>> >> > > > work given single GPU is working fine.
> >>>> >> > > >
> >>>> >> > > > Key though is that there are multiple ways to build GTX780s,
> >>>> and
> >>>> >>for
> >>>> >> > now
> >>>> >> > > > we have only tested one specific model which is as follows:
> >>>> >> > > >
> >>>> >> > > > http://tinyurl.com/prxlwy6 Zotac GTX780 ZT-70201-10P
> >>>> >> > > >
> >>>> >> > > >
> >>>> >> > > > Until we have an opportunity to test different vendor GTX780s
> >>>> and
> >>>> >>OC
> >>>> >> > > > versions the advice is to stick with the above model if you
> >>>> can.
> >>>> >> > > >
> >>>> >> > > > All the best
> >>>> >> > > > Ross
> >>>> >> > > >
> >>>> >> > > > /\
> >>>> >> > > > \/
> >>>> >> > > > |\oss Walker
> >>>> >> > > >
> >>>> >> > > > ---------------------------------------------------------
> >>>> >> > > > | Associate Research Professor |
> >>>> >> > > > | San Diego Supercomputer Center |
> >>>> >> > > > | Adjunct Associate Professor |
> >>>> >> > > > | Dept. of Chemistry and Biochemistry |
> >>>> >> > > > | University of California San Diego |
> >>>> >> > > > | NVIDIA Fellow |
> >>>> >> > > > | http://www.rosswalker.co.uk | http://www.wmd-lab.org |
> >>>> >> > > > | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
> >>>> >> > > > ---------------------------------------------------------
> >>>> >> > > >
> >>>> >> > > > Note: Electronic Mail is not secure, has no guarantee of
> >>>> delivery,
> >>>> >> may
> >>>> >> > > not
> >>>> >> > > > be read every day, and should not be used for urgent or
> >>>> sensitive
> >>>> >> > issues.
> >>>> >> > > >
> >>>> >> > > >
> >>>> >> > > >
> >>>> >> > > >
> >>>> >> > > >
> >>>> >> > > >
> >>>> >> > > >
> >>>> >> > > > _______________________________________________
> >>>> >> > > > AMBER mailing list
> >>>> >> > > > AMBER.ambermd.org
> >>>> >> > > > http://lists.ambermd.org/mailman/listinfo/amber
> >>>> >> > > >
> >>>> >> > > _______________________________________________
> >>>> >> > > AMBER mailing list
> >>>> >> > > AMBER.ambermd.org
> >>>> >> > > http://lists.ambermd.org/mailman/listinfo/amber
> >>>> >> > >
> >>>> >> > _______________________________________________
> >>>> >> > AMBER mailing list
> >>>> >> > AMBER.ambermd.org
> >>>> >> > http://lists.ambermd.org/mailman/listinfo/amber
> >>>> >> >
> >>>> >> _______________________________________________
> >>>> >> AMBER mailing list
> >>>> >> AMBER.ambermd.org
> >>>> >> http://lists.ambermd.org/mailman/listinfo/amber
> >>>> >>
> >>>> >_______________________________________________
> >>>> >AMBER mailing list
> >>>> >AMBER.ambermd.org
> >>>> >http://lists.ambermd.org/mailman/listinfo/amber
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> AMBER mailing list
> >>>> AMBER.ambermd.org
> >>>> http://lists.ambermd.org/mailman/listinfo/amber
> >>>>
> >>>
> >>>
> >>
> >
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jul 11 2013 - 10:00:04 PDT
Custom Search