Re: [AMBER] Error: unspecified launch failure launching kernel kReduceForces

From: Aron Broom <broomsday.gmail.com>
Date: Mon, 2 Apr 2012 13:57:26 -0400

Hi Ross,

I see that kind of error VERY RARELY on an M0270 regardless of simulation
size etc. On a GTX570 where the memory tests good and the power supply is
also good, I see it very rarely at moderate system sizes, but it becomes an
issue at larger sizes (>100,000 atoms, >50% of the available memory, but
maybe there is actually a bad memory block somewhere that I didn't find
with my quick memtests).

In terms of the GTX580 I've been using, it was failing memory tests
constantly, and as you say the power supply in that case is below the
recommended wattage for that card, and so it isn't surprising that I see
that error extremely often in that case.

These were all with the latest bug-fixes applied. I guess my point was
that even if the card is good (the M2070 case) this still happens from time
to time, certainly more frequently than on the CPU, and I think maybe it's
just good practice to do a quick search of your restart files to make sure
nothing like this happened.

In terms of the power supply causing these kind of problems, would you also
see memory tests failing because of that? I'd like to troubleshoot that
particular card, and if the card itself is fine and it just needs a beefier
supply that would be a fantastically easy fix.

~Aron

On Mon, Apr 2, 2012 at 12:56 PM, Ross Walker <ross.rosswalker.co.uk> wrote:

> Hi Aron,
>
> > As one more thing to add, AMBER being run on a GPU, particularly the
> > GTX
> > ones, seems to often run into the problem where the coordinates and
> > velocities get lost from one step to another. Maybe you've already
> > done
> > it, but Ross' response made me think that you should search your
> > restart
> > file for any 'NaN' entries.
>
> This is worrying... Are you really seeing this occur on a regular basis?
>
> If you are running the very latest version of the code (bugfix.20) You
> shouldn't see such errors unless you have some kind of hardware issue. I
> would suspect one of the following:
>
> 1) You are using an overclocked (or you overclocked yourself) GTX580.
>
> 2) Your card / computer is overheating.
>
> 3) Your power supply is underspecced for your machine running flat out.
>
> 4) Your GTX card is failing (happens - I've had several go bad, although
> mostly due to fan failures).
>
> You would also see NAN's etc occurring due to issues with your simulation,
> if something is unstable, bad parameters, strained bonds etc etc. These
> sort
> of errors should show up more frequently on a GTX card than a Tesla card
> though.
>
> All the best
> Ross
>
> /\
> \/
> |\oss Walker
>
> ---------------------------------------------------------
> | Assistant Research Professor |
> | San Diego Supercomputer Center |
> | Adjunct Assistant Professor |
> | Dept. of Chemistry and Biochemistry |
> | University of California San Diego |
> | NVIDIA Fellow |
> | http://www.rosswalker.co.uk | http://www.wmd-lab.org/ |
> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
> ---------------------------------------------------------
>
> Note: Electronic Mail is not secure, has no guarantee of delivery, may not
> be read every day, and should not be used for urgent or sensitive issues.
>
>
>
>
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>



-- 
Aron Broom M.Sc
PhD Student
Department of Chemistry
University of Waterloo
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Apr 02 2012 - 11:00:03 PDT
Custom Search