1) Initially, I ran the full set amber benchmarks at the standard setiings
(100k steps)
All the cards passed without issue
2) Increased nstlim to 200k steps
one card outright crashed with the error: Nonbond cells need to be
recalculated, restart simulation from previous checkpoint
with a higher value for skinnb.
reproducibility errors occuring in two other cards - in JAC NPT & Cellulose
NPT.
3) At this point decided to concentrate on JAC NPT as it is the largest
source of errors and ntslim can be extended without that much of a time
penalty.
So I extended nstlim to 2500000 and ran all cards simultaneously, albeit
with staggered start times to offset disk I/O.
mdin:
ntx=5, irest=1,
ntc=2, ntf=2,
nstlim=2500000,
ntpr=25000, ntwx=25000,
ntwr=250000,
dt=0.002, cut=8.,
ntt=1, tautp=10.0,
temp0=300.0,
ntb=2, ntp=1, taup=10.0,
ioutfm=1,ig=43689,
The card in PCI slot 0 never failed. The other 3 cards (named after which
PCIe slot they occupied) always failed in the following order:
card1 = within the first 10-20 mins
card2 = shortly after card1
card3 = takes a long time to fail. Almost gets to the end and sometimes
makes it
The failure error was always a skinnb type error
Obviously it was quite suspicious that only card0 in the primary PCIe slot
passed and I thought it may have something to do with the switching
function on the plex chip interfering with things. So I took all the cards
out and tested them individually in PCIe 0. All of them failed with the
skinnb error. Additionally, every step in the mdout file is populated with:
########################################
check COM velocity, temp: 0.000028 0.00(Removed)
check COM velocity, temp: 0.000037 0.00(Removed)
check COM velocity, temp: 0.000032 0.00(Removed)
check COM velocity, temp: 0.000032 0.00(Removed)
NSTEP = 25000 TIME(PS) = 56.000 TEMP(K) = 300.60 PRESS =
-254.5
Etot = -58129.2013 EKtot = 14450.1387 EPtot =
-72579.3399
BOND = 473.9411 ANGLE = 1296.5580 DIHED =
977.4736
1-4 NB = 551.6041 1-4 EEL = 6656.6898 VDWAALS =
8413.2767
EELEC = -90948.8832 EHBOND = 0.0000 RESTRAINT =
0.0000
EKCMT = 6304.1052 VIRIAL = 7593.8283 VOLUME =
234670.0662
Density =
1.0226
------------------------------------------------------------------------------
check COM velocity, temp: 0.000064 0.00(Removed)
check COM velocity, temp: 0.000035 0.00(Removed)
check COM velocity, temp: 0.000020 0.00(Removed)
check COM velocity, temp: 0.000031 0.00(Removed)
check COM velocity, temp: 0.000045 0.00(Removed)
check COM velocity, temp: 0.000023 0.00(Removed)
check COM velocity, temp: 0.000010 0.00(Removed)
check COM velocity, temp: 0.000022 0.00(Removed)
check COM velocity, temp: 0.000044 0.00(Removed)
check COM velocity, temp: 0.000047 0.00(Removed)
check COM velocity, temp: 0.000014 0.00(Removed)
check COM velocity, temp: 0.000032 0.00(Removed)
check COM velocity, temp: 0.000037 0.00(Removed)
check COM velocity, temp: 0.000017 0.00(Removed)
check COM velocity, temp: 0.000040 0.00(Removed)
check COM velocity, temp: 0.000028 0.00(Removed)
check COM velocity, temp: 0.000032 0.00(Removed)
check COM velocity, temp: 0.000014 0.00(Removed)
check COM velocity, temp: 0.000030 0.00(Removed)
check COM velocity, temp: 0.000042 0.00(Removed)
check COM velocity, temp: 0.000036 0.00(Removed)
check COM velocity, temp: 0.000027 0.00(Removed)
check COM velocity, temp: 0.000040 0.00(Removed)
check COM velocity, temp: 0.000026 0.00(Removed)
check COM velocity, temp: 0.000053 0.00(Removed)
NSTEP = 50000 TIME(PS) = 106.000 TEMP(K) = 299.91 PRESS =
60.9
Etot = -58186.5909 EKtot = 14416.8516 EPtot =
-72603.4424
BOND = 468.9608 ANGLE = 1272.6458 DIHED =
1000.8139
1-4 NB = 554.1092 1-4 EEL = 6681.9525 VDWAALS =
8584.4464
EELEC = -91166.3710 EHBOND = 0.0000 RESTRAINT =
0.0000
EKCMT = 6287.7987 VIRIAL = 5978.9791 VOLUME =
234711.1150
Density =
1.0224
------------------------------------------------------------------------------
check COM velocity, temp: 0.000048 0.00(Removed)
check COM velocity, temp: 0.000044 0.00(Removed)
check COM velocity, temp: 0.000034 0.00(Removed)
check COM velocity, temp: 0.000018 0.00(Removed)
########################################
I have not seen this before, am not sure whether this is normal or not. If
someone could clarify it would be appreciated.
4) I put card0 back into the box on its own and I ran 2x100ns of production
simulation of HIV-protease NPT with no issues. So am pretty convinced that
this card is good.
With my 4x780 setup 3 of the cards failed with errors and had NPT
deterministic issues when they did not crash, which seems very bad luck
considering Ross tested a 4-GPU combo with no failures at all. I thought
that this may have something to do with the particular batches of the card
that have been produced at various times. So i checked all the serial
numbers printed on the hardware. The serial numbers, etc were all the same,
but what was quite weird was that only the card that was working had a
distinctive stamp:
"T7-E5"
Probably it's nothing, but it would be interesting to know whether any
other owners of working Zotac 780's have this stamp or not.
Going to RMA 3x Zotacs now and go to 680s.
On 8 July 2013 08:29, ET <sketchfoot.gmail.com> wrote:
> !ai caramba! :/
>
> it looks like 3 of the cards are consistently failing with skinnb errors
> on.....
>
>
> you guessed it:
>
> JAC NPT
>
> Have been running tests this weekend. Will post my findings later today or
> tomorrow.
>
>
> On 3 July 2013 12:58, ET <sketchfoot.gmail.com> wrote:
>
>> FYI: Just got 2x Zotac 780s and ran the benchmark tests.
>>
>> All the tests were reproducible across 2x repeats.
>>
>> Going to get a couple of more today.
>>
>> br,
>> g
>>
>>
>> On 27 June 2013 21:43, ET <sketchfoot.gmail.com> wrote:
>>
>>> no worries. :) Already RMA's 2x Titans and bought 2x Zotacs. Will check
>>> 'em tomorrow. If they are good will order another 2.
>>>
>>> Thanks again for testing them.
>>>
>>>
>>> On 27 June 2013 19:43, Ross Walker <ross.rosswalker.co.uk> wrote:
>>>
>>>> The GTX780s do not appear to be broken - we are just being cautious
>>>> right
>>>> now.
>>>>
>>>> The Titan's are broken for everyone right now - well broken for anyone
>>>> who
>>>> actually hits what they are broken with - which is still being
>>>> investigated. But certainly for anyone who uses cuFFT the Titan's appear
>>>> to broken right now.
>>>>
>>>> All the best
>>>> Ross
>>>>
>>>>
>>>>
>>>> On 6/27/13 11:20 AM, "ET" <sketchfoot.gmail.com> wrote:
>>>>
>>>> >Are they "broken" only in terms of AMBER? Or could this be classed as a
>>>> >general hardware fault pertaining to all applications that use the
>>>> card?
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >On 27 June 2013 18:50, Scott Le Grand <varelse2005.gmail.com> wrote:
>>>> >
>>>> >> It's not really a question of how it's programmed, it's a question of
>>>> >> manufacturing. One picks 12 out of 15 processor cores on the chip
>>>> >>itself
>>>> >> to make a GTX 780 as opposed to picking 14 out of 15 processor cores
>>>> to
>>>> >> make a GTX Titan. In the former, there are 455 ways to do so and in
>>>> the
>>>> >> latter, 15.
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> On Wed, Jun 26, 2013 at 7:13 PM, ET <sketchfoot.gmail.com> wrote:
>>>> >>
>>>> >> > Thanks very much for the quick information guys! It's much
>>>> >>appreciated.
>>>> >> >
>>>> >> > I'm not that up on the manner in which these cards are programmed,
>>>> so
>>>> >>am
>>>> >> a
>>>> >> > little confused by your explanation Scott. could you please
>>>> clarify it
>>>> >> for
>>>> >> > me?
>>>> >> >
>>>> >> > br,
>>>> >> > g
>>>> >> >
>>>> >> >
>>>> >> > On 27 June 2013 01:47, Scott Le Grand <varelse2005.gmail.com>
>>>> wrote:
>>>> >> >
>>>> >> > > To clarify, there are 15 SMXs in a GK110 GPU. For GTX Titan,
>>>> one of
>>>> >> them
>>>> >> > > is disabled. There are 15 (15 choose 1) ways to do this. All of
>>>> >>them
>>>> >> > seem
>>>> >> > > to be broken.
>>>> >> > >
>>>> >> > > There are 12 out of 15 active SMXs in GTX 780. That means there
>>>> are
>>>> >>455
>>>> >> > (15
>>>> >> > > choose 3) ways to make one. I'm a little nervous that some of
>>>> those
>>>> >> > > configurations may be broken, so the best thing to do is to test
>>>> if
>>>> >> they
>>>> >> > > exhibit deterministic behavior upon acquiring them, and if they
>>>> >>don't,
>>>> >> > RMA
>>>> >> > > them as defective.
>>>> >> > >
>>>> >> > >
>>>> >> > >
>>>> >> > >
>>>> >> > >
>>>> >> > >
>>>> >> > > On Wed, Jun 26, 2013 at 4:31 PM, Ross Walker <
>>>> ross.rosswalker.co.uk>
>>>> >> > > wrote:
>>>> >> > >
>>>> >> > > > Hi All,
>>>> >> > > >
>>>> >> > > > Ok, good news on the GTX780 front. After 4 days of testing
>>>> neither
>>>> >> > Scott
>>>> >> > > > nor myself have been able to break the GTX780s. This is in a 4
>>>> x
>>>> >> GTX780
>>>> >> > > > Exxact system although at present we have only tested multiple
>>>> >>single
>>>> >> > GPU
>>>> >> > > > runs using all 4 GPUs at once - I.e. pmemd.cuda (NOT
>>>> >>pmemd.cuda.MPI)
>>>> >> -
>>>> >> > I
>>>> >> > > > will be testing pmemd.cuda.MPI shortly but I don't see why this
>>>> >> > wouldn't
>>>> >> > > > work given single GPU is working fine.
>>>> >> > > >
>>>> >> > > > Key though is that there are multiple ways to build GTX780s,
>>>> and
>>>> >>for
>>>> >> > now
>>>> >> > > > we have only tested one specific model which is as follows:
>>>> >> > > >
>>>> >> > > > http://tinyurl.com/prxlwy6 Zotac GTX780 ZT-70201-10P
>>>> >> > > >
>>>> >> > > >
>>>> >> > > > Until we have an opportunity to test different vendor GTX780s
>>>> and
>>>> >>OC
>>>> >> > > > versions the advice is to stick with the above model if you
>>>> can.
>>>> >> > > >
>>>> >> > > > All the best
>>>> >> > > > Ross
>>>> >> > > >
>>>> >> > > > /\
>>>> >> > > > \/
>>>> >> > > > |\oss Walker
>>>> >> > > >
>>>> >> > > > ---------------------------------------------------------
>>>> >> > > > | Associate Research Professor |
>>>> >> > > > | San Diego Supercomputer Center |
>>>> >> > > > | Adjunct Associate Professor |
>>>> >> > > > | Dept. of Chemistry and Biochemistry |
>>>> >> > > > | University of California San Diego |
>>>> >> > > > | NVIDIA Fellow |
>>>> >> > > > | http://www.rosswalker.co.uk | http://www.wmd-lab.org |
>>>> >> > > > | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
>>>> >> > > > ---------------------------------------------------------
>>>> >> > > >
>>>> >> > > > Note: Electronic Mail is not secure, has no guarantee of
>>>> delivery,
>>>> >> may
>>>> >> > > not
>>>> >> > > > be read every day, and should not be used for urgent or
>>>> sensitive
>>>> >> > issues.
>>>> >> > > >
>>>> >> > > >
>>>> >> > > >
>>>> >> > > >
>>>> >> > > >
>>>> >> > > >
>>>> >> > > >
>>>> >> > > > _______________________________________________
>>>> >> > > > AMBER mailing list
>>>> >> > > > AMBER.ambermd.org
>>>> >> > > > http://lists.ambermd.org/mailman/listinfo/amber
>>>> >> > > >
>>>> >> > > _______________________________________________
>>>> >> > > AMBER mailing list
>>>> >> > > AMBER.ambermd.org
>>>> >> > > http://lists.ambermd.org/mailman/listinfo/amber
>>>> >> > >
>>>> >> > _______________________________________________
>>>> >> > AMBER mailing list
>>>> >> > AMBER.ambermd.org
>>>> >> > http://lists.ambermd.org/mailman/listinfo/amber
>>>> >> >
>>>> >> _______________________________________________
>>>> >> AMBER mailing list
>>>> >> AMBER.ambermd.org
>>>> >> http://lists.ambermd.org/mailman/listinfo/amber
>>>> >>
>>>> >_______________________________________________
>>>> >AMBER mailing list
>>>> >AMBER.ambermd.org
>>>> >http://lists.ambermd.org/mailman/listinfo/amber
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org
>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>
>>>
>>>
>>
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jul 11 2013 - 09:30:05 PDT