Re: [AMBER] Error: unspecified launch failure launching kernel kReduceForces

From: Scott Le Grand <varelse2005.gmail.com>
Date: Wed, 11 Apr 2012 08:57:43 -0700

If you have a card that fails memory tests, just stop using it until you
can make it pass them or replace it. An insufficient power supply is
enough cause for this. Vijay Pande has had a world of trouble from
F.Hfans who do this and produce garbage results both on AMD and NVIDIA
graphics cards. It's every bit as bad as not sufficiently cooling the CPU.

Also when you buy Ge Force cards, you're getting an amazing bargain, but in
exchange, you're literally pushing them far beyond their tested limits when
you run on 100,000+ atom systems. Games and apps simply do not crank the
GPU up to 11 without end for more than a 20th of a second or so without a
break for vertical blank. Tesla cards are clocked down and binned so they
are both warrantied for this stuff and likely capable of handling it. If
you're really serious about this, try looking at all the hoops the
overclocking community goes through to max out GPU performance. That'd
even be a great article somewhere if you ask me (not that anyone is).

Scott




On Mon, Apr 2, 2012 at 10:57 AM, Aron Broom <broomsday.gmail.com> wrote:

> Hi Ross,
>
> I see that kind of error VERY RARELY on an M0270 regardless of simulation
> size etc. On a GTX570 where the memory tests good and the power supply is
> also good, I see it very rarely at moderate system sizes, but it becomes an
> issue at larger sizes (>100,000 atoms, >50% of the available memory, but
> maybe there is actually a bad memory block somewhere that I didn't find
> with my quick memtests).
>
> In terms of the GTX580 I've been using, it was failing memory tests
> constantly, and as you say the power supply in that case is below the
> recommended wattage for that card, and so it isn't surprising that I see
> that error extremely often in that case.
>
> These were all with the latest bug-fixes applied. I guess my point was
> that even if the card is good (the M2070 case) this still happens from time
> to time, certainly more frequently than on the CPU, and I think maybe it's
> just good practice to do a quick search of your restart files to make sure
> nothing like this happened.
>
> In terms of the power supply causing these kind of problems, would you also
> see memory tests failing because of that? I'd like to troubleshoot that
> particular card, and if the card itself is fine and it just needs a beefier
> supply that would be a fantastically easy fix.
>
> ~Aron
>
> On Mon, Apr 2, 2012 at 12:56 PM, Ross Walker <ross.rosswalker.co.uk>
> wrote:
>
> > Hi Aron,
> >
> > > As one more thing to add, AMBER being run on a GPU, particularly the
> > > GTX
> > > ones, seems to often run into the problem where the coordinates and
> > > velocities get lost from one step to another. Maybe you've already
> > > done
> > > it, but Ross' response made me think that you should search your
> > > restart
> > > file for any 'NaN' entries.
> >
> > This is worrying... Are you really seeing this occur on a regular basis?
> >
> > If you are running the very latest version of the code (bugfix.20) You
> > shouldn't see such errors unless you have some kind of hardware issue. I
> > would suspect one of the following:
> >
> > 1) You are using an overclocked (or you overclocked yourself) GTX580.
> >
> > 2) Your card / computer is overheating.
> >
> > 3) Your power supply is underspecced for your machine running flat out.
> >
> > 4) Your GTX card is failing (happens - I've had several go bad, although
> > mostly due to fan failures).
> >
> > You would also see NAN's etc occurring due to issues with your
> simulation,
> > if something is unstable, bad parameters, strained bonds etc etc. These
> > sort
> > of errors should show up more frequently on a GTX card than a Tesla card
> > though.
> >
> > All the best
> > Ross
> >
> > /\
> > \/
> > |\oss Walker
> >
> > ---------------------------------------------------------
> > | Assistant Research Professor |
> > | San Diego Supercomputer Center |
> > | Adjunct Assistant Professor |
> > | Dept. of Chemistry and Biochemistry |
> > | University of California San Diego |
> > | NVIDIA Fellow |
> > | http://www.rosswalker.co.uk | http://www.wmd-lab.org/ |
> > | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
> > ---------------------------------------------------------
> >
> > Note: Electronic Mail is not secure, has no guarantee of delivery, may
> not
> > be read every day, and should not be used for urgent or sensitive issues.
> >
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
>
>
>
> --
> Aron Broom M.Sc
> PhD Student
> Department of Chemistry
> University of Waterloo
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Apr 11 2012 - 09:00:04 PDT
Custom Search