Re: [AMBER] cudaMemcpy GpuBuffer::Download failed unspecified launch failure

From: filip fratev <filipfratev.yahoo.com>
Date: Thu, 30 Jan 2014 08:32:55 -0800 (PST)

Hi Ross,

I'd like to thank you for your advice to make RMA!! 
I received the the new card last week and performed some tests these days. The new card is much more stable and the problems were resolved (no any crashes). I can't be 100% sure and need to perform more tests but this card is really much more stable. Thus it seems that GTX780Ti SC is not bad GPU :)

Regards,
Filip






On Monday, January 13, 2014 11:28 PM, Ross Walker <ross.rosswalker.co.uk> wrote:
 
Hi Filip,

Not necessarily. I have seen GPUs be flaky - a lot of the time they give
wrong answers when they run the reproducibility test I put together and
for longer runs they randomly crash with skinnb errors, cuda upload errors
etc. In most cases replacing the GPU things work fine. E.g. with the AMBER
certified machines from Exxact part of the process involves burning in
each GPU on AMBER sims for 24 hours before shipping to make sure the
results are reproducible. I would estimate we see somewhere around 2 to 3%
of the GPUs being bad out of the box. - Anecdotally the frequency of
problematical GPUs seems to decrease with time (I.e. as a design and the
corresponding silicon ages
 and thus the yield improves).

My advice would be to just RMA it and you'll hopefully get a good 780TI
GPU replacing it.

All the best
Ross


On 1/13/14, 1:18 PM, "filip fratev" <filipfratev.yahoo.com> wrote:

>Hi Ross,
>I performed 100ns on a single protein (Estrogen Receptor alpha, 32 000
>atoms) without any crash. I suppose that even if I RMA the GPU the
>situation will be the same. This means that probably GTX 780 Ti SC is
>highly clocked and unstable and crashes in case of more "difficult"
>systems or probably some other issues are present ..
>
>Regards,
>Filip
>
>
>
>
>
>
>
>On Saturday, January 11, 2014 11:29 PM, filip fratev
><filipfratev.yahoo.com> wrote:
>
>Hi Ross again,
>>That said the fact that this used to run well and now doesn't suggests
>>top
>>me something wrong with the GPU.
>
>
>In fact this is what worry me. Initially I simulated very simple
>protein-protein complex, but now a bit complicated test system (ligands,
>co-factor, structural waters..etc.). Right now I run 100ns single
 protein
>and will see what will happen.
>
>
>Is there any possibility the problem to be general for GTX 780Ti i.e. the
>full GK110? I have some bad feeling .. :)
>
>Regards,
>Filip
>   
>
>
>
>
>On Saturday, January 11, 2014 5:00 PM, filip fratev
><filipfratev.yahoo.com> wrote:
>
>Hi Ross,
>Many thanks for your comments!! What I worry is
>that the card passed all possible stability tests under Linux and
>WindowsŠDo
>you thinks that ACEMD can be an
 additional good test?
>
>If someone here can confirm that his GTX 780Ti
>works without problem will be the best??
>Thanks in advance!
>
>Regards,
>Filip
>
>
>
>On Friday, January 10, 2014 5:38 PM, Ross Walker <ross.rosswalker.co.uk>
>wrote:
>
>Hi Filip,
>
>Try running the tests I sent you a couple more times. Go in and increase
>nstlim about 4x before running it so it runs the test for longer and see
>if any of the energies don't match. As soon as you see a mismatch or a
>crash there it is
 indicative of a bad GPU.
>
>That said the fact that this used to run well and now doesn't suggests top
>me something wrong with the GPU. I'd quickly try updating to the 319.60
>driver if you haven't already but I'd be surprised if it helps.
>
>My suggestion would be to RMA it - I've had no trouble RMA'ing gear.
>Typically I RMA it to the shop I bought it from (Amazon, Fry's etc) rather
>than the manufacturer since that's generally easier. Just say that it
>crashes or locks up in use (you don't have to say it is for GPU computing)
>and that peripheral replacement with an identical model avoids the problem
>proving that it is the GPU at fault. They will then replace it. I've never
>heard of a manufacturer testing something like a
 GPU before sending a
>replacement - their turnover is way too high for that to be cost
>effective.
>
>Worst case if you bought it on a credit card most credit cards offer
>extended warranty or purchase protection and you can just replace it
>through that. American Express for example has a guarantee for
>refund/replacement of a faulty product if the original seller refuses to
>accept the return.
>
>Hope that helps - just my experiences in life, this does not constitute
>official advice blah blah blah and all that junk...
>
>All the best
>Ross
>
>
>
>On 1/10/14, 3:13 AM, "filip fratev" <filipfratev.yahoo.com> wrote:
>
>>Hi again,
>>After the first crash (after 15ns simulation time) I can't make more than
>>1ns.... Probably the best what I can do is to change the Bios to the non
>>SC version and if I have the same problems on reference cloks
>>
>>
>>
>>
>>
>>On Friday, January 10, 2014 11:51 AM, filip fratev
>><filipfratev.yahoo.com> wrote:
>>
>>Hi Ross and all,
>>The problems with my GTX 780Ti SC continued and are real disaster. I
>>tried on a new system and it crash very often (every 1 000 000 steps, if
>>I am lucky I can get 10-15ns). ntf=1 seems to improve the situation but
>>it is not a general solution. No problems with the GTX Titan on the same
>>system. Should I test the card on some other PC? Should I downclock the
>>GPU?  Is it possible this to be Amber/Nvidia driver related problem? I
>>was wondering should I have to and whether is possible to ask EVGA for
>>RMA? They can say just ..this is a gaming card :( Anyone with similar
>>problems and GTX 780Ti?
>>
>>Regards,
>>Filip
>>
>>
>>
>>
>>On 12/21/13 11:23 AM, "Ross Walker"
 <ross.rosswalker.co.uk>
>>wrote:
>>
>>>Hi Filip,
>>>
>>>This was always my worry with the Ti cards (they are
>>clocked rather high)
>>>which is why I haven't put the numbers up yet on the
>>AMBER website. Let me
>>>send you off list a validation suite to run that will
>>test if the cards
>>>have issues or not.
>>>
>>>You have just 2 cards in a box yes (same for the Titan
>>machine)?
>>>
>>>All the best
>>>Ross
>>>
>>>
>>>On 12/21/13 11:08 AM, "filip fratev" <filipfratev.yahoo.com>
>>wrote:
>>>
>>>>Hi all,
>>>>Just to inform you that I observed two random
>>crashes of GTX 780Ti SC
>>>>with "cudaMemcpy GpuBuffer::Download failed
>>unspecified launch failure"
>>>>error during the last 2 weeks. No problems with the
>>same system on GTX
>>>>Titan. Should I make some memory test on this GPU?
>>What might be the
>>>>problem? Has anyone experienced similar problem
>>recently?
>>>>
>>>>
>>>>Regards,
>>>>Filip
>>>>_______________________________________________
>>>>AMBER mailing list
>>>>AMBER.ambermd.org
>>>>http://lists.ambermd.org/mailman/listinfo/amber
>>
>>>
>>>
>>>
>>>_______________________________________________
>>>AMBER mailing list
>>>AMBER.ambermd.org
>>>http://lists.ambermd.org/mailman/listinfo/amber

>
>
>
>>
>>
>>
>>
>>_______________________________________________
>>AMBER mailing list
>>AMBER.ambermd.org
>>http://lists.ambermd.org/mailman/listinfo/amber
>>_______________________________________________
>>AMBER mailing list
>>AMBER.ambermd.org
>>http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber



_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jan 30 2014 - 09:00:02 PST
Custom Search