So alas this doesn't surprise me at all. The skinnb errors happen when the
simulation goes to NaNaland in this case. What this comes down to right
now is that I don't trust *any* consumer GK110s.
On Thu, Jul 11, 2013 at 9:23 AM, ET <sketchfoot.gmail.com> wrote:
> 1) Initially, I ran the full set amber benchmarks at the standard setiings
> (100k steps)
>
> All the cards passed without issue
>
> 2) Increased nstlim to 200k steps
>
> one card outright crashed with the error: Nonbond cells need to be
> recalculated, restart simulation from previous checkpoint
> with a higher value for skinnb.
>
> reproducibility errors occuring in two other cards - in JAC NPT & Cellulose
> NPT.
>
> 3) At this point decided to concentrate on JAC NPT as it is the largest
> source of errors and ntslim can be extended without that much of a time
> penalty.
> So I extended nstlim to 2500000 and ran all cards simultaneously, albeit
> with staggered start times to offset disk I/O.
>
> mdin:
> ntx=5, irest=1,
> ntc=2, ntf=2,
> nstlim=2500000,
> ntpr=25000, ntwx=25000,
> ntwr=250000,
> dt=0.002, cut=8.,
> ntt=1, tautp=10.0,
> temp0=300.0,
> ntb=2, ntp=1, taup=10.0,
> ioutfm=1,ig=43689,
>
>
> The card in PCI slot 0 never failed. The other 3 cards (named after which
> PCIe slot they occupied) always failed in the following order:
>
> card1 = within the first 10-20 mins
> card2 = shortly after card1
> card3 = takes a long time to fail. Almost gets to the end and sometimes
> makes it
>
> The failure error was always a skinnb type error
>
> Obviously it was quite suspicious that only card0 in the primary PCIe slot
> passed and I thought it may have something to do with the switching
> function on the plex chip interfering with things. So I took all the cards
> out and tested them individually in PCIe 0. All of them failed with the
> skinnb error. Additionally, every step in the mdout file is populated with:
>
> ########################################
> check COM velocity, temp: 0.000028 0.00(Removed)
> check COM velocity, temp: 0.000037 0.00(Removed)
> check COM velocity, temp: 0.000032 0.00(Removed)
> check COM velocity, temp: 0.000032 0.00(Removed)
>
> NSTEP = 25000 TIME(PS) = 56.000 TEMP(K) = 300.60 PRESS =
> -254.5
> Etot = -58129.2013 EKtot = 14450.1387 EPtot =
> -72579.3399
> BOND = 473.9411 ANGLE = 1296.5580 DIHED =
> 977.4736
> 1-4 NB = 551.6041 1-4 EEL = 6656.6898 VDWAALS =
> 8413.2767
> EELEC = -90948.8832 EHBOND = 0.0000 RESTRAINT =
> 0.0000
> EKCMT = 6304.1052 VIRIAL = 7593.8283 VOLUME =
> 234670.0662
> Density =
> 1.0226
>
> ------------------------------------------------------------------------------
>
> check COM velocity, temp: 0.000064 0.00(Removed)
> check COM velocity, temp: 0.000035 0.00(Removed)
> check COM velocity, temp: 0.000020 0.00(Removed)
> check COM velocity, temp: 0.000031 0.00(Removed)
> check COM velocity, temp: 0.000045 0.00(Removed)
> check COM velocity, temp: 0.000023 0.00(Removed)
> check COM velocity, temp: 0.000010 0.00(Removed)
> check COM velocity, temp: 0.000022 0.00(Removed)
> check COM velocity, temp: 0.000044 0.00(Removed)
> check COM velocity, temp: 0.000047 0.00(Removed)
> check COM velocity, temp: 0.000014 0.00(Removed)
> check COM velocity, temp: 0.000032 0.00(Removed)
> check COM velocity, temp: 0.000037 0.00(Removed)
> check COM velocity, temp: 0.000017 0.00(Removed)
> check COM velocity, temp: 0.000040 0.00(Removed)
> check COM velocity, temp: 0.000028 0.00(Removed)
> check COM velocity, temp: 0.000032 0.00(Removed)
> check COM velocity, temp: 0.000014 0.00(Removed)
> check COM velocity, temp: 0.000030 0.00(Removed)
> check COM velocity, temp: 0.000042 0.00(Removed)
> check COM velocity, temp: 0.000036 0.00(Removed)
> check COM velocity, temp: 0.000027 0.00(Removed)
> check COM velocity, temp: 0.000040 0.00(Removed)
> check COM velocity, temp: 0.000026 0.00(Removed)
> check COM velocity, temp: 0.000053 0.00(Removed)
>
> NSTEP = 50000 TIME(PS) = 106.000 TEMP(K) = 299.91 PRESS =
> 60.9
> Etot = -58186.5909 EKtot = 14416.8516 EPtot =
> -72603.4424
> BOND = 468.9608 ANGLE = 1272.6458 DIHED =
> 1000.8139
> 1-4 NB = 554.1092 1-4 EEL = 6681.9525 VDWAALS =
> 8584.4464
> EELEC = -91166.3710 EHBOND = 0.0000 RESTRAINT =
> 0.0000
> EKCMT = 6287.7987 VIRIAL = 5978.9791 VOLUME =
> 234711.1150
> Density =
> 1.0224
>
> ------------------------------------------------------------------------------
>
> check COM velocity, temp: 0.000048 0.00(Removed)
> check COM velocity, temp: 0.000044 0.00(Removed)
> check COM velocity, temp: 0.000034 0.00(Removed)
> check COM velocity, temp: 0.000018 0.00(Removed)
> ########################################
>
> I have not seen this before, am not sure whether this is normal or not. If
> someone could clarify it would be appreciated.
>
> 4) I put card0 back into the box on its own and I ran 2x100ns of production
> simulation of HIV-protease NPT with no issues. So am pretty convinced that
> this card is good.
>
>
> With my 4x780 setup 3 of the cards failed with errors and had NPT
> deterministic issues when they did not crash, which seems very bad luck
> considering Ross tested a 4-GPU combo with no failures at all. I thought
> that this may have something to do with the particular batches of the card
> that have been produced at various times. So i checked all the serial
> numbers printed on the hardware. The serial numbers, etc were all the same,
> but what was quite weird was that only the card that was working had a
> distinctive stamp:
>
> "T7-E5"
>
> Probably it's nothing, but it would be interesting to know whether any
> other owners of working Zotac 780's have this stamp or not.
>
> Going to RMA 3x Zotacs now and go to 680s.
>
>
>
>
> On 8 July 2013 08:29, ET <sketchfoot.gmail.com> wrote:
>
> > !ai caramba! :/
> >
> > it looks like 3 of the cards are consistently failing with skinnb errors
> > on.....
> >
> >
> > you guessed it:
> >
> > JAC NPT
> >
> > Have been running tests this weekend. Will post my findings later today
> or
> > tomorrow.
> >
> >
> > On 3 July 2013 12:58, ET <sketchfoot.gmail.com> wrote:
> >
> >> FYI: Just got 2x Zotac 780s and ran the benchmark tests.
> >>
> >> All the tests were reproducible across 2x repeats.
> >>
> >> Going to get a couple of more today.
> >>
> >> br,
> >> g
> >>
> >>
> >> On 27 June 2013 21:43, ET <sketchfoot.gmail.com> wrote:
> >>
> >>> no worries. :) Already RMA's 2x Titans and bought 2x Zotacs. Will check
> >>> 'em tomorrow. If they are good will order another 2.
> >>>
> >>> Thanks again for testing them.
> >>>
> >>>
> >>> On 27 June 2013 19:43, Ross Walker <ross.rosswalker.co.uk> wrote:
> >>>
> >>>> The GTX780s do not appear to be broken - we are just being cautious
> >>>> right
> >>>> now.
> >>>>
> >>>> The Titan's are broken for everyone right now - well broken for anyone
> >>>> who
> >>>> actually hits what they are broken with - which is still being
> >>>> investigated. But certainly for anyone who uses cuFFT the Titan's
> appear
> >>>> to broken right now.
> >>>>
> >>>> All the best
> >>>> Ross
> >>>>
> >>>>
> >>>>
> >>>> On 6/27/13 11:20 AM, "ET" <sketchfoot.gmail.com> wrote:
> >>>>
> >>>> >Are they "broken" only in terms of AMBER? Or could this be classed
> as a
> >>>> >general hardware fault pertaining to all applications that use the
> >>>> card?
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>> >On 27 June 2013 18:50, Scott Le Grand <varelse2005.gmail.com> wrote:
> >>>> >
> >>>> >> It's not really a question of how it's programmed, it's a question
> of
> >>>> >> manufacturing. One picks 12 out of 15 processor cores on the chip
> >>>> >>itself
> >>>> >> to make a GTX 780 as opposed to picking 14 out of 15 processor
> cores
> >>>> to
> >>>> >> make a GTX Titan. In the former, there are 455 ways to do so and
> in
> >>>> the
> >>>> >> latter, 15.
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >> On Wed, Jun 26, 2013 at 7:13 PM, ET <sketchfoot.gmail.com> wrote:
> >>>> >>
> >>>> >> > Thanks very much for the quick information guys! It's much
> >>>> >>appreciated.
> >>>> >> >
> >>>> >> > I'm not that up on the manner in which these cards are
> programmed,
> >>>> so
> >>>> >>am
> >>>> >> a
> >>>> >> > little confused by your explanation Scott. could you please
> >>>> clarify it
> >>>> >> for
> >>>> >> > me?
> >>>> >> >
> >>>> >> > br,
> >>>> >> > g
> >>>> >> >
> >>>> >> >
> >>>> >> > On 27 June 2013 01:47, Scott Le Grand <varelse2005.gmail.com>
> >>>> wrote:
> >>>> >> >
> >>>> >> > > To clarify, there are 15 SMXs in a GK110 GPU. For GTX Titan,
> >>>> one of
> >>>> >> them
> >>>> >> > > is disabled. There are 15 (15 choose 1) ways to do this. All
> of
> >>>> >>them
> >>>> >> > seem
> >>>> >> > > to be broken.
> >>>> >> > >
> >>>> >> > > There are 12 out of 15 active SMXs in GTX 780. That means there
> >>>> are
> >>>> >>455
> >>>> >> > (15
> >>>> >> > > choose 3) ways to make one. I'm a little nervous that some of
> >>>> those
> >>>> >> > > configurations may be broken, so the best thing to do is to
> test
> >>>> if
> >>>> >> they
> >>>> >> > > exhibit deterministic behavior upon acquiring them, and if they
> >>>> >>don't,
> >>>> >> > RMA
> >>>> >> > > them as defective.
> >>>> >> > >
> >>>> >> > >
> >>>> >> > >
> >>>> >> > >
> >>>> >> > >
> >>>> >> > >
> >>>> >> > > On Wed, Jun 26, 2013 at 4:31 PM, Ross Walker <
> >>>> ross.rosswalker.co.uk>
> >>>> >> > > wrote:
> >>>> >> > >
> >>>> >> > > > Hi All,
> >>>> >> > > >
> >>>> >> > > > Ok, good news on the GTX780 front. After 4 days of testing
> >>>> neither
> >>>> >> > Scott
> >>>> >> > > > nor myself have been able to break the GTX780s. This is in a
> 4
> >>>> x
> >>>> >> GTX780
> >>>> >> > > > Exxact system although at present we have only tested
> multiple
> >>>> >>single
> >>>> >> > GPU
> >>>> >> > > > runs using all 4 GPUs at once - I.e. pmemd.cuda (NOT
> >>>> >>pmemd.cuda.MPI)
> >>>> >> -
> >>>> >> > I
> >>>> >> > > > will be testing pmemd.cuda.MPI shortly but I don't see why
> this
> >>>> >> > wouldn't
> >>>> >> > > > work given single GPU is working fine.
> >>>> >> > > >
> >>>> >> > > > Key though is that there are multiple ways to build GTX780s,
> >>>> and
> >>>> >>for
> >>>> >> > now
> >>>> >> > > > we have only tested one specific model which is as follows:
> >>>> >> > > >
> >>>> >> > > > http://tinyurl.com/prxlwy6 Zotac GTX780 ZT-70201-10P
> >>>> >> > > >
> >>>> >> > > >
> >>>> >> > > > Until we have an opportunity to test different vendor GTX780s
> >>>> and
> >>>> >>OC
> >>>> >> > > > versions the advice is to stick with the above model if you
> >>>> can.
> >>>> >> > > >
> >>>> >> > > > All the best
> >>>> >> > > > Ross
> >>>> >> > > >
> >>>> >> > > > /\
> >>>> >> > > > \/
> >>>> >> > > > |\oss Walker
> >>>> >> > > >
> >>>> >> > > > ---------------------------------------------------------
> >>>> >> > > > | Associate Research Professor |
> >>>> >> > > > | San Diego Supercomputer Center |
> >>>> >> > > > | Adjunct Associate Professor |
> >>>> >> > > > | Dept. of Chemistry and Biochemistry |
> >>>> >> > > > | University of California San Diego |
> >>>> >> > > > | NVIDIA Fellow |
> >>>> >> > > > | http://www.rosswalker.co.uk | http://www.wmd-lab.org |
> >>>> >> > > > | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
> >>>> >> > > > ---------------------------------------------------------
> >>>> >> > > >
> >>>> >> > > > Note: Electronic Mail is not secure, has no guarantee of
> >>>> delivery,
> >>>> >> may
> >>>> >> > > not
> >>>> >> > > > be read every day, and should not be used for urgent or
> >>>> sensitive
> >>>> >> > issues.
> >>>> >> > > >
> >>>> >> > > >
> >>>> >> > > >
> >>>> >> > > >
> >>>> >> > > >
> >>>> >> > > >
> >>>> >> > > >
> >>>> >> > > > _______________________________________________
> >>>> >> > > > AMBER mailing list
> >>>> >> > > > AMBER.ambermd.org
> >>>> >> > > > http://lists.ambermd.org/mailman/listinfo/amber
> >>>> >> > > >
> >>>> >> > > _______________________________________________
> >>>> >> > > AMBER mailing list
> >>>> >> > > AMBER.ambermd.org
> >>>> >> > > http://lists.ambermd.org/mailman/listinfo/amber
> >>>> >> > >
> >>>> >> > _______________________________________________
> >>>> >> > AMBER mailing list
> >>>> >> > AMBER.ambermd.org
> >>>> >> > http://lists.ambermd.org/mailman/listinfo/amber
> >>>> >> >
> >>>> >> _______________________________________________
> >>>> >> AMBER mailing list
> >>>> >> AMBER.ambermd.org
> >>>> >> http://lists.ambermd.org/mailman/listinfo/amber
> >>>> >>
> >>>> >_______________________________________________
> >>>> >AMBER mailing list
> >>>> >AMBER.ambermd.org
> >>>> >http://lists.ambermd.org/mailman/listinfo/amber
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> AMBER mailing list
> >>>> AMBER.ambermd.org
> >>>> http://lists.ambermd.org/mailman/listinfo/amber
> >>>>
> >>>
> >>>
> >>
> >
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jul 11 2013 - 10:00:04 PDT