Re: [AMBER] cuda launch time out error in Amber 11

From: Sasha Buzko <obuzko.ucla.edu>
Date: Fri, 27 Aug 2010 10:26:25 -0700

Sergio, Ross, Scott,
in my experience, this error seems to be a common issue on multiple
GTX480s, so it's not a single faulty card. I have 8 of these, and had
these errors with all of them. Power doesn't appear to be the problem as
well - I've tested the 8-GPU Colfax server with only one GTX480
installed, and got the error (and the server can supply well over 2 kW
of power). I use GTX480s made by Asus (not overclocked).

Not sure if it's the ECC (just a thought), but clearly there is a
systemic issue with Fermi-based consumer-level cards. I haven't seen
this problem on a GTX280 that I had used previously. And as Ross said, a
GTX295 card worked fine as well.

Sasha





Ross Walker wrote:
> Hi Sergio,
>
>
>> Thanks for your comments. It appears that the launch error has not
>> shown up on any Tesla card so far. I wonder if anyone could run with
>> ECC turned off in a Tesla card to determine whether that is the issue.
>>
>
> It's doubtful it is ECC. See my other post to Sasha regarding running
> C2050's with ECC turned off to get better performance.
>
>
>> If it's ECC, then one could explain the behavior of different cards.
>> Here at SFSU there is another system using a GTX240 card that has not
>> seen the error for 18,000 atoms. My GTX470 has only seen the error on
>> systems larger than 60,000 atoms(within 1 ns), but not at 25,000 atoms
>> (for 20 ns, multiple times).
>>
>
> Have you got another GTX470 available that you could test on and see if it
> is reproducible on identical cards? It is possibly say that your graphics
> memory is marginal in terms of reliability and perhaps that error lies in
> the high end of the address space. That would explain why smaller systems
> can run fine. This is pure speculation though. If you see the error on
> multiple cards then it is much more likely to be a bug in the code.
>
> I trust the card you have is NOT overclocked on the GPU core or the memory.
>
>
>> Sasha's GTX480 card sees it within a few
>> ns with 20,000 atoms. This variability seems to indicate hardware
>> issues in the card, or environmental factors that cause a memory error.
>> My system is running in an office with heavy concrete walls on 5 of six
>> sides, giving some shielding from cosmic rays that could cause a bit
>> error in memory. Sasha: is yours more exposed?
>>
>
> I still subscribe to Seymour Cray's quote regarding this: "Parity is for
> Farmers."
>
>
>> If it's not this, then hardware quality could be a factor. As Scott
>> mentioned, the quality control is not as strict for consumer cards as
>> it is for Tesla cards.
>>
>
> Indeed. The other issue might be if you have a card from someone like BFG
> who sell overclocked versions etc. I would generally trust the PNY cards
> more. If you can though you may just want to RMA the card and let them send
> you a new one and see if it has the same issues.
>
>
>> I verify my trajectories by doing ensemble
>> average hydrodynamic computations to compare with experiment for the
>> proteins that I study. I have yet to verify the "good" trajectories
>> produced with cuda. I have computed several with standard
>> multiprocessor machines to make the comparison. If ECC is an issue,
>> the trajectory quality may not be good either, even if it finishes. I
>>
>
> I would be extremely surprised if any hardware failure or memory error gave
> you a trajectory that otherwise looked good. I think if you have anything
> dodgy with hardware etc you will see an unexplained blow up, shake error or
> other major issue.
>
>
>> will report later on that. If this is the case, then this should put a
>> nail in the coffin on the use of consumer cards for doing science.
>>
>
> Before you do that though either 1) RMA the card and/or 2) Get yourself a
> decent 1KW power supply.
>
> All the best
> Ross
>
> /\
> \/
> |\oss Walker
>
> ---------------------------------------------------------
> | Assistant Research Professor |
> | San Diego Supercomputer Center |
> | Adjunct Assistant Professor |
> | Dept. of Chemistry and Biochemistry |
> | University of California San Diego |
> | NVIDIA Fellow |
> | http://www.rosswalker.co.uk | http://www.wmd-lab.org/ |
> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
> ---------------------------------------------------------
>
> Note: Electronic Mail is not secure, has no guarantee of delivery, may not
> be read every day, and should not be used for urgent or sensitive issues.
>
>
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Aug 27 2010 - 11:00:03 PDT
Custom Search