so do you think if I want trouble free GPU runs I'm best off returning all
the 780's, even the one that has not failed any of the
benchmarks/reproducibility errors?
Also for my reference are multiple "check COM velocity, temp" type of
entries in the mdout (on completing a run successfully) OK or not?
check COM velocity, temp:        0.000064     0.00(Removed)
check COM velocity, temp:        0.000035     0.00(Removed)
check COM velocity, temp:        0.000020     0.00(Removed)
check COM velocity, temp:        0.000031     0.00(Removed)
check COM velocity, temp:        0.000045     0.00(Removed)
check COM velocity, temp:        0.000023     0.00(Removed)
check COM velocity, temp:        0.000010     0.00(Removed)
check COM velocity, temp:        0.000022     0.00(Removed)
check COM velocity, temp:        0.000044     0.00(Removed)
check COM velocity, temp:        0.000047     0.00(Removed)
check COM velocity, temp:        0.000014     0.00(Removed)
check COM velocity, temp:        0.000032     0.00(Removed)
check COM velocity, temp:        0.000037     0.00(Removed)
many thanks
On 11 July 2013 17:47, Scott Le Grand <varelse2005.gmail.com> wrote:
> So alas this doesn't surprise me at all.  The skinnb errors happen when the
> simulation goes to NaNaland in this case.  What this comes down to right
> now is that I don't trust *any* consumer GK110s.
>
>
>
>
> On Thu, Jul 11, 2013 at 9:23 AM, ET <sketchfoot.gmail.com> wrote:
>
> > 1) Initially, I ran the full set amber benchmarks at the standard
> setiings
> > (100k steps)
> >
> > All the cards passed without issue
> >
> > 2) Increased nstlim to 200k steps
> >
> > one card outright crashed with the error: Nonbond cells need to be
> > recalculated, restart simulation from previous checkpoint
> > with a higher value for skinnb.
> >
> > reproducibility errors occuring in two other cards - in JAC NPT &
> Cellulose
> > NPT.
> >
> > 3) At this point decided to concentrate on JAC NPT as it is the largest
> > source of errors and ntslim can be extended without that much of a time
> > penalty.
> > So I extended nstlim to 2500000 and ran all cards simultaneously, albeit
> > with staggered start times to offset disk I/O.
> >
> > mdin:
> >    ntx=5, irest=1,
> >    ntc=2, ntf=2,
> >    nstlim=2500000,
> >    ntpr=25000, ntwx=25000,
> >    ntwr=250000,
> >    dt=0.002, cut=8.,
> >    ntt=1, tautp=10.0,
> >    temp0=300.0,
> >    ntb=2, ntp=1, taup=10.0,
> >    ioutfm=1,ig=43689,
> >
> >
> > The card in PCI slot 0 never failed. The other 3 cards (named after which
> > PCIe slot they occupied) always failed in the following order:
> >
> > card1 = within the first 10-20 mins
> > card2 = shortly after card1
> > card3 = takes a long time to fail. Almost gets to the end and sometimes
> > makes it
> >
> > The failure error was always a skinnb type error
> >
> > Obviously it was quite suspicious that only card0 in the primary PCIe
> slot
> > passed and I thought it may have something to do with the switching
> > function on the plex chip interfering with things. So I took all the
> cards
> > out and tested them individually in PCIe 0. All of them failed with the
> > skinnb error. Additionally, every step in the mdout file is populated
> with:
> >
> > ########################################
> > check COM velocity, temp:        0.000028     0.00(Removed)
> > check COM velocity, temp:        0.000037     0.00(Removed)
> > check COM velocity, temp:        0.000032     0.00(Removed)
> > check COM velocity, temp:        0.000032     0.00(Removed)
> >
> >  NSTEP =    25000   TIME(PS) =      56.000  TEMP(K) =   300.60  PRESS =
> > -254.5
> >  Etot   =    -58129.2013  EKtot   =     14450.1387  EPtot      =
> > -72579.3399
> >  BOND   =       473.9411  ANGLE   =      1296.5580  DIHED      =
> > 977.4736
> >  1-4 NB =       551.6041  1-4 EEL =      6656.6898  VDWAALS    =
> > 8413.2767
> >  EELEC  =    -90948.8832  EHBOND  =         0.0000  RESTRAINT  =
> > 0.0000
> >  EKCMT  =      6304.1052  VIRIAL  =      7593.8283  VOLUME     =
> > 234670.0662
> >                                                     Density    =
> > 1.0226
> >
> >
>  ------------------------------------------------------------------------------
> >
> > check COM velocity, temp:        0.000064     0.00(Removed)
> > check COM velocity, temp:        0.000035     0.00(Removed)
> > check COM velocity, temp:        0.000020     0.00(Removed)
> > check COM velocity, temp:        0.000031     0.00(Removed)
> > check COM velocity, temp:        0.000045     0.00(Removed)
> > check COM velocity, temp:        0.000023     0.00(Removed)
> > check COM velocity, temp:        0.000010     0.00(Removed)
> > check COM velocity, temp:        0.000022     0.00(Removed)
> > check COM velocity, temp:        0.000044     0.00(Removed)
> > check COM velocity, temp:        0.000047     0.00(Removed)
> > check COM velocity, temp:        0.000014     0.00(Removed)
> > check COM velocity, temp:        0.000032     0.00(Removed)
> > check COM velocity, temp:        0.000037     0.00(Removed)
> > check COM velocity, temp:        0.000017     0.00(Removed)
> > check COM velocity, temp:        0.000040     0.00(Removed)
> > check COM velocity, temp:        0.000028     0.00(Removed)
> > check COM velocity, temp:        0.000032     0.00(Removed)
> > check COM velocity, temp:        0.000014     0.00(Removed)
> > check COM velocity, temp:        0.000030     0.00(Removed)
> > check COM velocity, temp:        0.000042     0.00(Removed)
> > check COM velocity, temp:        0.000036     0.00(Removed)
> > check COM velocity, temp:        0.000027     0.00(Removed)
> > check COM velocity, temp:        0.000040     0.00(Removed)
> > check COM velocity, temp:        0.000026     0.00(Removed)
> > check COM velocity, temp:        0.000053     0.00(Removed)
> >
> >  NSTEP =    50000   TIME(PS) =     106.000  TEMP(K) =   299.91  PRESS =
> > 60.9
> >  Etot   =    -58186.5909  EKtot   =     14416.8516  EPtot      =
> > -72603.4424
> >  BOND   =       468.9608  ANGLE   =      1272.6458  DIHED      =
> > 1000.8139
> >  1-4 NB =       554.1092  1-4 EEL =      6681.9525  VDWAALS    =
> > 8584.4464
> >  EELEC  =    -91166.3710  EHBOND  =         0.0000  RESTRAINT  =
> > 0.0000
> >  EKCMT  =      6287.7987  VIRIAL  =      5978.9791  VOLUME     =
> > 234711.1150
> >                                                     Density    =
> > 1.0224
> >
> >
>  ------------------------------------------------------------------------------
> >
> > check COM velocity, temp:        0.000048     0.00(Removed)
> > check COM velocity, temp:        0.000044     0.00(Removed)
> > check COM velocity, temp:        0.000034     0.00(Removed)
> > check COM velocity, temp:        0.000018     0.00(Removed)
> > ########################################
> >
> > I have not seen this before, am not sure whether this is normal or not.
> If
> > someone could clarify it would be appreciated.
> >
> > 4) I put card0 back into the box on its own and I ran 2x100ns of
> production
> > simulation of HIV-protease NPT with no issues. So am pretty convinced
> that
> > this card is good.
> >
> >
> > With my 4x780 setup 3 of the cards failed with errors and had NPT
> > deterministic issues when they did not crash, which seems very bad luck
> > considering Ross tested a 4-GPU combo with no failures at all.  I thought
> > that this may have something to do with the particular batches of the
> card
> > that have been produced at various times. So i checked all the serial
> > numbers printed on the hardware. The serial numbers, etc were all the
> same,
> > but what was quite weird was that only the card that was working had a
> > distinctive stamp:
> >
> > "T7-E5"
> >
> > Probably it's nothing, but it would be interesting to know whether any
> > other owners of working Zotac 780's have this stamp or not.
> >
> > Going to RMA 3x Zotacs now and go to 680s.
> >
> >
> >
> >
> > On 8 July 2013 08:29, ET <sketchfoot.gmail.com> wrote:
> >
> > > !ai caramba! :/
> > >
> > > it looks like 3 of the cards are consistently failing with skinnb
> errors
> > > on.....
> > >
> > >
> > > you guessed it:
> > >
> > > JAC NPT
> > >
> > > Have been running tests this weekend. Will post my findings later today
> > or
> > > tomorrow.
> > >
> > >
> > > On 3 July 2013 12:58, ET <sketchfoot.gmail.com> wrote:
> > >
> > >> FYI: Just got 2x Zotac 780s and ran the benchmark tests.
> > >>
> > >> All the tests were reproducible across 2x repeats.
> > >>
> > >> Going to get a couple of more today.
> > >>
> > >> br,
> > >> g
> > >>
> > >>
> > >> On 27 June 2013 21:43, ET <sketchfoot.gmail.com> wrote:
> > >>
> > >>> no worries. :) Already RMA's 2x Titans and bought 2x Zotacs. Will
> check
> > >>> 'em tomorrow. If they are good will order another 2.
> > >>>
> > >>> Thanks again for testing them.
> > >>>
> > >>>
> > >>> On 27 June 2013 19:43, Ross Walker <ross.rosswalker.co.uk> wrote:
> > >>>
> > >>>> The GTX780s do not appear to be broken - we are just being cautious
> > >>>> right
> > >>>> now.
> > >>>>
> > >>>> The Titan's are broken for everyone right now - well broken for
> anyone
> > >>>> who
> > >>>> actually hits what they are broken with - which is still being
> > >>>> investigated. But certainly for anyone who uses cuFFT the Titan's
> > appear
> > >>>> to broken right now.
> > >>>>
> > >>>> All the best
> > >>>> Ross
> > >>>>
> > >>>>
> > >>>>
> > >>>> On 6/27/13 11:20 AM, "ET" <sketchfoot.gmail.com> wrote:
> > >>>>
> > >>>> >Are they "broken" only in terms of AMBER? Or could this be classed
> > as a
> > >>>> >general hardware fault pertaining to all applications that use the
> > >>>> card?
> > >>>> >
> > >>>> >
> > >>>> >
> > >>>> >
> > >>>> >On 27 June 2013 18:50, Scott Le Grand <varelse2005.gmail.com>
> wrote:
> > >>>> >
> > >>>> >> It's not really a question of how it's programmed, it's a
> question
> > of
> > >>>> >> manufacturing.   One picks 12 out of 15 processor cores on the
> chip
> > >>>> >>itself
> > >>>> >> to make a GTX 780 as opposed to picking 14 out of 15 processor
> > cores
> > >>>> to
> > >>>> >> make a GTX Titan.  In the former, there are 455 ways to do so and
> > in
> > >>>> the
> > >>>> >> latter, 15.
> > >>>> >>
> > >>>> >>
> > >>>> >>
> > >>>> >>
> > >>>> >>
> > >>>> >>
> > >>>> >> On Wed, Jun 26, 2013 at 7:13 PM, ET <sketchfoot.gmail.com>
> wrote:
> > >>>> >>
> > >>>> >> > Thanks very much for the quick information guys! It's much
> > >>>> >>appreciated.
> > >>>> >> >
> > >>>> >> > I'm not that up on the manner in which these cards are
> > programmed,
> > >>>> so
> > >>>> >>am
> > >>>> >> a
> > >>>> >> > little confused by your explanation Scott. could you please
> > >>>> clarify it
> > >>>> >> for
> > >>>> >> > me?
> > >>>> >> >
> > >>>> >> > br,
> > >>>> >> > g
> > >>>> >> >
> > >>>> >> >
> > >>>> >> > On 27 June 2013 01:47, Scott Le Grand <varelse2005.gmail.com>
> > >>>> wrote:
> > >>>> >> >
> > >>>> >> > > To clarify, there are 15 SMXs in a GK110 GPU.  For GTX Titan,
> > >>>> one of
> > >>>> >> them
> > >>>> >> > > is disabled.  There are 15 (15 choose 1) ways to do this.
>  All
> > of
> > >>>> >>them
> > >>>> >> > seem
> > >>>> >> > > to be broken.
> > >>>> >> > >
> > >>>> >> > > There are 12 out of 15 active SMXs in GTX 780. That means
> there
> > >>>> are
> > >>>> >>455
> > >>>> >> > (15
> > >>>> >> > > choose 3) ways to make one.  I'm a little nervous that some
> of
> > >>>> those
> > >>>> >> > > configurations may be broken, so the best thing to do is to
> > test
> > >>>> if
> > >>>> >> they
> > >>>> >> > > exhibit deterministic behavior upon acquiring them, and if
> they
> > >>>> >>don't,
> > >>>> >> > RMA
> > >>>> >> > > them as defective.
> > >>>> >> > >
> > >>>> >> > >
> > >>>> >> > >
> > >>>> >> > >
> > >>>> >> > >
> > >>>> >> > >
> > >>>> >> > > On Wed, Jun 26, 2013 at 4:31 PM, Ross Walker <
> > >>>> ross.rosswalker.co.uk>
> > >>>> >> > > wrote:
> > >>>> >> > >
> > >>>> >> > > > Hi All,
> > >>>> >> > > >
> > >>>> >> > > > Ok, good news on the GTX780 front. After 4 days of testing
> > >>>> neither
> > >>>> >> > Scott
> > >>>> >> > > > nor myself have been able to break the GTX780s. This is in
> a
> > 4
> > >>>> x
> > >>>> >> GTX780
> > >>>> >> > > > Exxact system although at present we have only tested
> > multiple
> > >>>> >>single
> > >>>> >> > GPU
> > >>>> >> > > > runs using all 4 GPUs at once - I.e. pmemd.cuda (NOT
> > >>>> >>pmemd.cuda.MPI)
> > >>>> >> -
> > >>>> >> > I
> > >>>> >> > > > will be testing pmemd.cuda.MPI shortly but I don't see why
> > this
> > >>>> >> > wouldn't
> > >>>> >> > > > work given single GPU is working fine.
> > >>>> >> > > >
> > >>>> >> > > > Key though is that there are multiple ways to build
> GTX780s,
> > >>>> and
> > >>>> >>for
> > >>>> >> > now
> > >>>> >> > > > we have only tested one specific model which is as follows:
> > >>>> >> > > >
> > >>>> >> > > > http://tinyurl.com/prxlwy6   Zotac GTX780 ZT-70201-10P
> > >>>> >> > > >
> > >>>> >> > > >
> > >>>> >> > > > Until we have an opportunity to test different vendor
> GTX780s
> > >>>> and
> > >>>> >>OC
> > >>>> >> > > > versions the advice is to stick with the above model if you
> > >>>> can.
> > >>>> >> > > >
> > >>>> >> > > > All the best
> > >>>> >> > > > Ross
> > >>>> >> > > >
> > >>>> >> > > > /\
> > >>>> >> > > > \/
> > >>>> >> > > > |\oss Walker
> > >>>> >> > > >
> > >>>> >> > > > ---------------------------------------------------------
> > >>>> >> > > > |             Associate Research Professor              |
> > >>>> >> > > > |            San Diego Supercomputer Center             |
> > >>>> >> > > > |             Adjunct Associate Professor               |
> > >>>> >> > > > |         Dept. of Chemistry and Biochemistry           |
> > >>>> >> > > > |          University of California San Diego           |
> > >>>> >> > > > |                     NVIDIA Fellow                     |
> > >>>> >> > > > | http://www.rosswalker.co.uk | http://www.wmd-lab.org  |
> > >>>> >> > > > | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk  |
> > >>>> >> > > > ---------------------------------------------------------
> > >>>> >> > > >
> > >>>> >> > > > Note: Electronic Mail is not secure, has no guarantee of
> > >>>> delivery,
> > >>>> >> may
> > >>>> >> > > not
> > >>>> >> > > > be read every day, and should not be used for urgent or
> > >>>> sensitive
> > >>>> >> > issues.
> > >>>> >> > > >
> > >>>> >> > > >
> > >>>> >> > > >
> > >>>> >> > > >
> > >>>> >> > > >
> > >>>> >> > > >
> > >>>> >> > > >
> > >>>> >> > > > _______________________________________________
> > >>>> >> > > > AMBER mailing list
> > >>>> >> > > > AMBER.ambermd.org
> > >>>> >> > > > http://lists.ambermd.org/mailman/listinfo/amber
> > >>>> >> > > >
> > >>>> >> > > _______________________________________________
> > >>>> >> > > AMBER mailing list
> > >>>> >> > > AMBER.ambermd.org
> > >>>> >> > > http://lists.ambermd.org/mailman/listinfo/amber
> > >>>> >> > >
> > >>>> >> > _______________________________________________
> > >>>> >> > AMBER mailing list
> > >>>> >> > AMBER.ambermd.org
> > >>>> >> > http://lists.ambermd.org/mailman/listinfo/amber
> > >>>> >> >
> > >>>> >> _______________________________________________
> > >>>> >> AMBER mailing list
> > >>>> >> AMBER.ambermd.org
> > >>>> >> http://lists.ambermd.org/mailman/listinfo/amber
> > >>>> >>
> > >>>> >_______________________________________________
> > >>>> >AMBER mailing list
> > >>>> >AMBER.ambermd.org
> > >>>> >http://lists.ambermd.org/mailman/listinfo/amber
> > >>>>
> > >>>>
> > >>>>
> > >>>> _______________________________________________
> > >>>> AMBER mailing list
> > >>>> AMBER.ambermd.org
> > >>>> http://lists.ambermd.org/mailman/listinfo/amber
> > >>>>
> > >>>
> > >>>
> > >>
> > >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jul 11 2013 - 12:00:02 PDT