Re: [AMBER] cuda launch time out error in Amber 11

From: Ross Walker <ross.rosswalker.co.uk>
Date: Fri, 27 Aug 2010 06:12:54 -0700

Hi Sergio,

> Thanks for your comments. It appears that the launch error has not
> shown up on any Tesla card so far. I wonder if anyone could run with
> ECC turned off in a Tesla card to determine whether that is the issue.

It's doubtful it is ECC. See my other post to Sasha regarding running
C2050's with ECC turned off to get better performance.

> If it's ECC, then one could explain the behavior of different cards.
> Here at SFSU there is another system using a GTX240 card that has not
> seen the error for 18,000 atoms. My GTX470 has only seen the error on
> systems larger than 60,000 atoms(within 1 ns), but not at 25,000 atoms
> (for 20 ns, multiple times).

Have you got another GTX470 available that you could test on and see if it
is reproducible on identical cards? It is possibly say that your graphics
memory is marginal in terms of reliability and perhaps that error lies in
the high end of the address space. That would explain why smaller systems
can run fine. This is pure speculation though. If you see the error on
multiple cards then it is much more likely to be a bug in the code.

I trust the card you have is NOT overclocked on the GPU core or the memory.

> Sasha's GTX480 card sees it within a few
> ns with 20,000 atoms. This variability seems to indicate hardware
> issues in the card, or environmental factors that cause a memory error.
> My system is running in an office with heavy concrete walls on 5 of six
> sides, giving some shielding from cosmic rays that could cause a bit
> error in memory. Sasha: is yours more exposed?

I still subscribe to Seymour Cray's quote regarding this: "Parity is for
Farmers."

> If it's not this, then hardware quality could be a factor. As Scott
> mentioned, the quality control is not as strict for consumer cards as
> it is for Tesla cards.

Indeed. The other issue might be if you have a card from someone like BFG
who sell overclocked versions etc. I would generally trust the PNY cards
more. If you can though you may just want to RMA the card and let them send
you a new one and see if it has the same issues.

> I verify my trajectories by doing ensemble
> average hydrodynamic computations to compare with experiment for the
> proteins that I study. I have yet to verify the "good" trajectories
> produced with cuda. I have computed several with standard
> multiprocessor machines to make the comparison. If ECC is an issue,
> the trajectory quality may not be good either, even if it finishes. I

I would be extremely surprised if any hardware failure or memory error gave
you a trajectory that otherwise looked good. I think if you have anything
dodgy with hardware etc you will see an unexplained blow up, shake error or
other major issue.

> will report later on that. If this is the case, then this should put a
> nail in the coffin on the use of consumer cards for doing science.

Before you do that though either 1) RMA the card and/or 2) Get yourself a
decent 1KW power supply.

All the best
Ross

/\
\/
|\oss Walker

---------------------------------------------------------
| Assistant Research Professor |
| San Diego Supercomputer Center |
| Adjunct Assistant Professor |
| Dept. of Chemistry and Biochemistry |
| University of California San Diego |
| NVIDIA Fellow |
| http://www.rosswalker.co.uk | http://www.wmd-lab.org/ |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
---------------------------------------------------------

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.





_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Aug 27 2010 - 06:30:04 PDT
Custom Search