Re: [AMBER] GTX780 News

From: ET <sketchfoot.gmail.com>
Date: Fri, 12 Jul 2013 02:58:32 +0100

So it appears that Zotac does have two 780 card bundles:

http://www.zotac.com/products/graphics-cards/geforce-700-series/gtx-780/product/gtx-780/category/geforce-700-series/main-category/graphics-cards/sort/product_name/order/DESC/amount/10/discontinued/1.html

There is the:

1) ZT-70201-10P
2) ZT-70202-10P

On casual inspection (the compare feature doesn't seem to work and it's
late), it seems to me that the only difference is the inclusion of the game
or not.

I know that my invoice for 4x780's were for all ZT-70201-10P. However, the
fact that only the one working card had the characteristic stamp ("T7-E5") is
bugging me.

Ross, could you confirm whether the working 780 you performed the extensive
tests on had this stamp or not please?

much obliged. :)


On 11 July 2013 19:46, ET <sketchfoot.gmail.com> wrote:

> so do you think if I want trouble free GPU runs I'm best off returning all
> the 780's, even the one that has not failed any of the
> benchmarks/reproducibility errors?
>
> Also for my reference are multiple "check COM velocity, temp" type of
> entries in the mdout (on completing a run successfully) OK or not?
>
>
> check COM velocity, temp: 0.000064 0.00(Removed)
> check COM velocity, temp: 0.000035 0.00(Removed)
> check COM velocity, temp: 0.000020 0.00(Removed)
> check COM velocity, temp: 0.000031 0.00(Removed)
> check COM velocity, temp: 0.000045 0.00(Removed)
> check COM velocity, temp: 0.000023 0.00(Removed)
> check COM velocity, temp: 0.000010 0.00(Removed)
> check COM velocity, temp: 0.000022 0.00(Removed)
> check COM velocity, temp: 0.000044 0.00(Removed)
> check COM velocity, temp: 0.000047 0.00(Removed)
> check COM velocity, temp: 0.000014 0.00(Removed)
> check COM velocity, temp: 0.000032 0.00(Removed)
> check COM velocity, temp: 0.000037 0.00(Removed)
>
> many thanks
>
>
> On 11 July 2013 17:47, Scott Le Grand <varelse2005.gmail.com> wrote:
>
>> So alas this doesn't surprise me at all. The skinnb errors happen when
>> the
>> simulation goes to NaNaland in this case. What this comes down to right
>> now is that I don't trust *any* consumer GK110s.
>>
>>
>>
>>
>> On Thu, Jul 11, 2013 at 9:23 AM, ET <sketchfoot.gmail.com> wrote:
>>
>> > 1) Initially, I ran the full set amber benchmarks at the standard
>> setiings
>> > (100k steps)
>> >
>> > All the cards passed without issue
>> >
>> > 2) Increased nstlim to 200k steps
>> >
>> > one card outright crashed with the error: Nonbond cells need to be
>> > recalculated, restart simulation from previous checkpoint
>> > with a higher value for skinnb.
>> >
>> > reproducibility errors occuring in two other cards - in JAC NPT &
>> Cellulose
>> > NPT.
>> >
>> > 3) At this point decided to concentrate on JAC NPT as it is the largest
>> > source of errors and ntslim can be extended without that much of a time
>> > penalty.
>> > So I extended nstlim to 2500000 and ran all cards simultaneously, albeit
>> > with staggered start times to offset disk I/O.
>> >
>> > mdin:
>> > ntx=5, irest=1,
>> > ntc=2, ntf=2,
>> > nstlim=2500000,
>> > ntpr=25000, ntwx=25000,
>> > ntwr=250000,
>> > dt=0.002, cut=8.,
>> > ntt=1, tautp=10.0,
>> > temp0=300.0,
>> > ntb=2, ntp=1, taup=10.0,
>> > ioutfm=1,ig=43689,
>> >
>> >
>> > The card in PCI slot 0 never failed. The other 3 cards (named after
>> which
>> > PCIe slot they occupied) always failed in the following order:
>> >
>> > card1 = within the first 10-20 mins
>> > card2 = shortly after card1
>> > card3 = takes a long time to fail. Almost gets to the end and sometimes
>> > makes it
>> >
>> > The failure error was always a skinnb type error
>> >
>> > Obviously it was quite suspicious that only card0 in the primary PCIe
>> slot
>> > passed and I thought it may have something to do with the switching
>> > function on the plex chip interfering with things. So I took all the
>> cards
>> > out and tested them individually in PCIe 0. All of them failed with the
>> > skinnb error. Additionally, every step in the mdout file is populated
>> with:
>> >
>> > ########################################
>> > check COM velocity, temp: 0.000028 0.00(Removed)
>> > check COM velocity, temp: 0.000037 0.00(Removed)
>> > check COM velocity, temp: 0.000032 0.00(Removed)
>> > check COM velocity, temp: 0.000032 0.00(Removed)
>> >
>> > NSTEP = 25000 TIME(PS) = 56.000 TEMP(K) = 300.60 PRESS =
>> > -254.5
>> > Etot = -58129.2013 EKtot = 14450.1387 EPtot =
>> > -72579.3399
>> > BOND = 473.9411 ANGLE = 1296.5580 DIHED =
>> > 977.4736
>> > 1-4 NB = 551.6041 1-4 EEL = 6656.6898 VDWAALS =
>> > 8413.2767
>> > EELEC = -90948.8832 EHBOND = 0.0000 RESTRAINT =
>> > 0.0000
>> > EKCMT = 6304.1052 VIRIAL = 7593.8283 VOLUME =
>> > 234670.0662
>> > Density =
>> > 1.0226
>> >
>> >
>> ------------------------------------------------------------------------------
>> >
>> > check COM velocity, temp: 0.000064 0.00(Removed)
>> > check COM velocity, temp: 0.000035 0.00(Removed)
>> > check COM velocity, temp: 0.000020 0.00(Removed)
>> > check COM velocity, temp: 0.000031 0.00(Removed)
>> > check COM velocity, temp: 0.000045 0.00(Removed)
>> > check COM velocity, temp: 0.000023 0.00(Removed)
>> > check COM velocity, temp: 0.000010 0.00(Removed)
>> > check COM velocity, temp: 0.000022 0.00(Removed)
>> > check COM velocity, temp: 0.000044 0.00(Removed)
>> > check COM velocity, temp: 0.000047 0.00(Removed)
>> > check COM velocity, temp: 0.000014 0.00(Removed)
>> > check COM velocity, temp: 0.000032 0.00(Removed)
>> > check COM velocity, temp: 0.000037 0.00(Removed)
>> > check COM velocity, temp: 0.000017 0.00(Removed)
>> > check COM velocity, temp: 0.000040 0.00(Removed)
>> > check COM velocity, temp: 0.000028 0.00(Removed)
>> > check COM velocity, temp: 0.000032 0.00(Removed)
>> > check COM velocity, temp: 0.000014 0.00(Removed)
>> > check COM velocity, temp: 0.000030 0.00(Removed)
>> > check COM velocity, temp: 0.000042 0.00(Removed)
>> > check COM velocity, temp: 0.000036 0.00(Removed)
>> > check COM velocity, temp: 0.000027 0.00(Removed)
>> > check COM velocity, temp: 0.000040 0.00(Removed)
>> > check COM velocity, temp: 0.000026 0.00(Removed)
>> > check COM velocity, temp: 0.000053 0.00(Removed)
>> >
>> > NSTEP = 50000 TIME(PS) = 106.000 TEMP(K) = 299.91 PRESS =
>> > 60.9
>> > Etot = -58186.5909 EKtot = 14416.8516 EPtot =
>> > -72603.4424
>> > BOND = 468.9608 ANGLE = 1272.6458 DIHED =
>> > 1000.8139
>> > 1-4 NB = 554.1092 1-4 EEL = 6681.9525 VDWAALS =
>> > 8584.4464
>> > EELEC = -91166.3710 EHBOND = 0.0000 RESTRAINT =
>> > 0.0000
>> > EKCMT = 6287.7987 VIRIAL = 5978.9791 VOLUME =
>> > 234711.1150
>> > Density =
>> > 1.0224
>> >
>> >
>> ------------------------------------------------------------------------------
>> >
>> > check COM velocity, temp: 0.000048 0.00(Removed)
>> > check COM velocity, temp: 0.000044 0.00(Removed)
>> > check COM velocity, temp: 0.000034 0.00(Removed)
>> > check COM velocity, temp: 0.000018 0.00(Removed)
>> > ########################################
>> >
>> > I have not seen this before, am not sure whether this is normal or not.
>> If
>> > someone could clarify it would be appreciated.
>> >
>> > 4) I put card0 back into the box on its own and I ran 2x100ns of
>> production
>> > simulation of HIV-protease NPT with no issues. So am pretty convinced
>> that
>> > this card is good.
>> >
>> >
>> > With my 4x780 setup 3 of the cards failed with errors and had NPT
>> > deterministic issues when they did not crash, which seems very bad luck
>> > considering Ross tested a 4-GPU combo with no failures at all. I
>> thought
>> > that this may have something to do with the particular batches of the
>> card
>> > that have been produced at various times. So i checked all the serial
>> > numbers printed on the hardware. The serial numbers, etc were all the
>> same,
>> > but what was quite weird was that only the card that was working had a
>> > distinctive stamp:
>> >
>> > "T7-E5"
>> >
>> > Probably it's nothing, but it would be interesting to know whether any
>> > other owners of working Zotac 780's have this stamp or not.
>> >
>> > Going to RMA 3x Zotacs now and go to 680s.
>> >
>> >
>> >
>> >
>> > On 8 July 2013 08:29, ET <sketchfoot.gmail.com> wrote:
>> >
>> > > !ai caramba! :/
>> > >
>> > > it looks like 3 of the cards are consistently failing with skinnb
>> errors
>> > > on.....
>> > >
>> > >
>> > > you guessed it:
>> > >
>> > > JAC NPT
>> > >
>> > > Have been running tests this weekend. Will post my findings later
>> today
>> > or
>> > > tomorrow.
>> > >
>> > >
>> > > On 3 July 2013 12:58, ET <sketchfoot.gmail.com> wrote:
>> > >
>> > >> FYI: Just got 2x Zotac 780s and ran the benchmark tests.
>> > >>
>> > >> All the tests were reproducible across 2x repeats.
>> > >>
>> > >> Going to get a couple of more today.
>> > >>
>> > >> br,
>> > >> g
>> > >>
>> > >>
>> > >> On 27 June 2013 21:43, ET <sketchfoot.gmail.com> wrote:
>> > >>
>> > >>> no worries. :) Already RMA's 2x Titans and bought 2x Zotacs. Will
>> check
>> > >>> 'em tomorrow. If they are good will order another 2.
>> > >>>
>> > >>> Thanks again for testing them.
>> > >>>
>> > >>>
>> > >>> On 27 June 2013 19:43, Ross Walker <ross.rosswalker.co.uk> wrote:
>> > >>>
>> > >>>> The GTX780s do not appear to be broken - we are just being cautious
>> > >>>> right
>> > >>>> now.
>> > >>>>
>> > >>>> The Titan's are broken for everyone right now - well broken for
>> anyone
>> > >>>> who
>> > >>>> actually hits what they are broken with - which is still being
>> > >>>> investigated. But certainly for anyone who uses cuFFT the Titan's
>> > appear
>> > >>>> to broken right now.
>> > >>>>
>> > >>>> All the best
>> > >>>> Ross
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>> On 6/27/13 11:20 AM, "ET" <sketchfoot.gmail.com> wrote:
>> > >>>>
>> > >>>> >Are they "broken" only in terms of AMBER? Or could this be classed
>> > as a
>> > >>>> >general hardware fault pertaining to all applications that use the
>> > >>>> card?
>> > >>>> >
>> > >>>> >
>> > >>>> >
>> > >>>> >
>> > >>>> >On 27 June 2013 18:50, Scott Le Grand <varelse2005.gmail.com>
>> wrote:
>> > >>>> >
>> > >>>> >> It's not really a question of how it's programmed, it's a
>> question
>> > of
>> > >>>> >> manufacturing. One picks 12 out of 15 processor cores on the
>> chip
>> > >>>> >>itself
>> > >>>> >> to make a GTX 780 as opposed to picking 14 out of 15 processor
>> > cores
>> > >>>> to
>> > >>>> >> make a GTX Titan. In the former, there are 455 ways to do so
>> and
>> > in
>> > >>>> the
>> > >>>> >> latter, 15.
>> > >>>> >>
>> > >>>> >>
>> > >>>> >>
>> > >>>> >>
>> > >>>> >>
>> > >>>> >>
>> > >>>> >> On Wed, Jun 26, 2013 at 7:13 PM, ET <sketchfoot.gmail.com>
>> wrote:
>> > >>>> >>
>> > >>>> >> > Thanks very much for the quick information guys! It's much
>> > >>>> >>appreciated.
>> > >>>> >> >
>> > >>>> >> > I'm not that up on the manner in which these cards are
>> > programmed,
>> > >>>> so
>> > >>>> >>am
>> > >>>> >> a
>> > >>>> >> > little confused by your explanation Scott. could you please
>> > >>>> clarify it
>> > >>>> >> for
>> > >>>> >> > me?
>> > >>>> >> >
>> > >>>> >> > br,
>> > >>>> >> > g
>> > >>>> >> >
>> > >>>> >> >
>> > >>>> >> > On 27 June 2013 01:47, Scott Le Grand <varelse2005.gmail.com>
>> > >>>> wrote:
>> > >>>> >> >
>> > >>>> >> > > To clarify, there are 15 SMXs in a GK110 GPU. For GTX
>> Titan,
>> > >>>> one of
>> > >>>> >> them
>> > >>>> >> > > is disabled. There are 15 (15 choose 1) ways to do this.
>> All
>> > of
>> > >>>> >>them
>> > >>>> >> > seem
>> > >>>> >> > > to be broken.
>> > >>>> >> > >
>> > >>>> >> > > There are 12 out of 15 active SMXs in GTX 780. That means
>> there
>> > >>>> are
>> > >>>> >>455
>> > >>>> >> > (15
>> > >>>> >> > > choose 3) ways to make one. I'm a little nervous that some
>> of
>> > >>>> those
>> > >>>> >> > > configurations may be broken, so the best thing to do is to
>> > test
>> > >>>> if
>> > >>>> >> they
>> > >>>> >> > > exhibit deterministic behavior upon acquiring them, and if
>> they
>> > >>>> >>don't,
>> > >>>> >> > RMA
>> > >>>> >> > > them as defective.
>> > >>>> >> > >
>> > >>>> >> > >
>> > >>>> >> > >
>> > >>>> >> > >
>> > >>>> >> > >
>> > >>>> >> > >
>> > >>>> >> > > On Wed, Jun 26, 2013 at 4:31 PM, Ross Walker <
>> > >>>> ross.rosswalker.co.uk>
>> > >>>> >> > > wrote:
>> > >>>> >> > >
>> > >>>> >> > > > Hi All,
>> > >>>> >> > > >
>> > >>>> >> > > > Ok, good news on the GTX780 front. After 4 days of testing
>> > >>>> neither
>> > >>>> >> > Scott
>> > >>>> >> > > > nor myself have been able to break the GTX780s. This is
>> in a
>> > 4
>> > >>>> x
>> > >>>> >> GTX780
>> > >>>> >> > > > Exxact system although at present we have only tested
>> > multiple
>> > >>>> >>single
>> > >>>> >> > GPU
>> > >>>> >> > > > runs using all 4 GPUs at once - I.e. pmemd.cuda (NOT
>> > >>>> >>pmemd.cuda.MPI)
>> > >>>> >> -
>> > >>>> >> > I
>> > >>>> >> > > > will be testing pmemd.cuda.MPI shortly but I don't see why
>> > this
>> > >>>> >> > wouldn't
>> > >>>> >> > > > work given single GPU is working fine.
>> > >>>> >> > > >
>> > >>>> >> > > > Key though is that there are multiple ways to build
>> GTX780s,
>> > >>>> and
>> > >>>> >>for
>> > >>>> >> > now
>> > >>>> >> > > > we have only tested one specific model which is as
>> follows:
>> > >>>> >> > > >
>> > >>>> >> > > > http://tinyurl.com/prxlwy6 Zotac GTX780 ZT-70201-10P
>> > >>>> >> > > >
>> > >>>> >> > > >
>> > >>>> >> > > > Until we have an opportunity to test different vendor
>> GTX780s
>> > >>>> and
>> > >>>> >>OC
>> > >>>> >> > > > versions the advice is to stick with the above model if
>> you
>> > >>>> can.
>> > >>>> >> > > >
>> > >>>> >> > > > All the best
>> > >>>> >> > > > Ross
>> > >>>> >> > > >
>> > >>>> >> > > > /\
>> > >>>> >> > > > \/
>> > >>>> >> > > > |\oss Walker
>> > >>>> >> > > >
>> > >>>> >> > > > ---------------------------------------------------------
>> > >>>> >> > > > | Associate Research Professor |
>> > >>>> >> > > > | San Diego Supercomputer Center |
>> > >>>> >> > > > | Adjunct Associate Professor |
>> > >>>> >> > > > | Dept. of Chemistry and Biochemistry |
>> > >>>> >> > > > | University of California San Diego |
>> > >>>> >> > > > | NVIDIA Fellow |
>> > >>>> >> > > > | http://www.rosswalker.co.uk | http://www.wmd-lab.org |
>> > >>>> >> > > > | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
>> > >>>> >> > > > ---------------------------------------------------------
>> > >>>> >> > > >
>> > >>>> >> > > > Note: Electronic Mail is not secure, has no guarantee of
>> > >>>> delivery,
>> > >>>> >> may
>> > >>>> >> > > not
>> > >>>> >> > > > be read every day, and should not be used for urgent or
>> > >>>> sensitive
>> > >>>> >> > issues.
>> > >>>> >> > > >
>> > >>>> >> > > >
>> > >>>> >> > > >
>> > >>>> >> > > >
>> > >>>> >> > > >
>> > >>>> >> > > >
>> > >>>> >> > > >
>> > >>>> >> > > > _______________________________________________
>> > >>>> >> > > > AMBER mailing list
>> > >>>> >> > > > AMBER.ambermd.org
>> > >>>> >> > > > http://lists.ambermd.org/mailman/listinfo/amber
>> > >>>> >> > > >
>> > >>>> >> > > _______________________________________________
>> > >>>> >> > > AMBER mailing list
>> > >>>> >> > > AMBER.ambermd.org
>> > >>>> >> > > http://lists.ambermd.org/mailman/listinfo/amber
>> > >>>> >> > >
>> > >>>> >> > _______________________________________________
>> > >>>> >> > AMBER mailing list
>> > >>>> >> > AMBER.ambermd.org
>> > >>>> >> > http://lists.ambermd.org/mailman/listinfo/amber
>> > >>>> >> >
>> > >>>> >> _______________________________________________
>> > >>>> >> AMBER mailing list
>> > >>>> >> AMBER.ambermd.org
>> > >>>> >> http://lists.ambermd.org/mailman/listinfo/amber
>> > >>>> >>
>> > >>>> >_______________________________________________
>> > >>>> >AMBER mailing list
>> > >>>> >AMBER.ambermd.org
>> > >>>> >http://lists.ambermd.org/mailman/listinfo/amber
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>> _______________________________________________
>> > >>>> AMBER mailing list
>> > >>>> AMBER.ambermd.org
>> > >>>> http://lists.ambermd.org/mailman/listinfo/amber
>> > >>>>
>> > >>>
>> > >>>
>> > >>
>> > >
>> > _______________________________________________
>> > AMBER mailing list
>> > AMBER.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber
>> >
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jul 11 2013 - 19:00:04 PDT
Custom Search