Re: [AMBER] GTX 780SC error "cudaMempcpy GpuBuffer :: Download failed unspecified launch failure" from Jason Swails on 2014-09-18 (Amber Archive Sep 2014)

From: Jason Swails <jason.swails.gmail.com>
Date: Thu, 18 Sep 2014 07:52:26 -0400

On Thu, 2014-09-18 at 10:51 +0200, Dieter Buyst wrote:
> Dear All,
>
> Quite recently we upgraded our MD computer (ubuntu 12.04 LTS, CUDA 5.0
> and driver 340.24) with two EVGA GTX780 SC GPUs since the 780 Ti
> models were not recommended due to the stability issues. After the
> installation I performed the usual tests for running calculations on a
> single and both GPUs at the same time. I did notice in both scenarios
> there were about 30 possible failures, but on inspection of the .diff
> files they were just small errors in the last decimal place. Likewise,
> the benchmark suite produced results which were in line with what can
> be expected for our configuration.

A good test if you are concerned about testing that you have a good GPU
is to run the same (at least 1 ns) simulation multiple times (with
identical random seeds or, even better, no stochastic thermostat at all)
and make sure you get exactly the same trajectories and energies. Some
GPUs are faulty in such a subtle way that only by seeing differences in
~20-30% of long replicate simulations can you detect a problem (this is
something the regression tests cannot detect).
>
> While experimenting with the Scaled_MD feature now available in
> Amber14, both me and a colleague sometimes run into the error
> "cudaMempcpy GpuBuffer :: Donwload failed unspecified launch failure".
> This doesn't happen very often but does pop up when we're performing
> some longer runs. I already checked the mailing archive for similar
> problems and it is suggested that probably one of the GPUs is faulty
> and is causing these problems. Now I just wanted to make sure I'm
> right and rule out whether this error could happen due the nature of
> the scaled MD feature, given it's brand new and possibly not fully
> tested yet ?

If I had to guess, I suspect that your system may be blowing up. See if
there is any way to "zoom in" on the end of the simulation (I presume
that it always dies in exactly the same place if you run two simulations
with the exact same settings on the same GPU, right?) Enhanced sampling
techniques, like aMD and scaled MD, are designed to increase your
sampling of high-energy portions of the free energy landscape, so my gut
feeling is that these simulations are more susceptible to the kinds of
large forces that are liable to blow up a simulation.
>
> In addition, I'm wondering if one can still trust the trajectory
> produced during these errors or it's better to just start from scratch
> with a new GPU ?

I think it's worth doing some tests to see if it's the GPU or something
else as well. If it's a blowup, your trajectory is likely fine (before
the blowup, obviously). As you've pointed out, this is rather new
functionality so you are playing in largely uncharted waters. We would
be interested in anything you find out here.

Good luck!
Jason

-- 
Jason M. Swails
BioMaPS,
Rutgers University
Postdoctoral Researcher
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Thu Sep 18 2014 - 05:00:05 PDT