Re: [AMBER] GTX780Ti consistent error "cudaMemcpy GpuBuffer::Download failed unspecified launch failure."

From: Ross Walker <ross.rosswalker.co.uk>
Date: Wed, 16 Apr 2014 10:27:45 -0700

Hi Keiran,

If driver 337.12 does not fix things then it is almost certainly a bad
GPU. I've seen lots of issues with 780Ti since they appear to be right on
the edge of stability with regards to clock speed. I recommend returning
it for exchange.

All the best
Ross


On 4/15/14, 11:14 PM, "Keiran Rowell" <k.rowell.unsw.edu.au> wrote:

>Dear Amber users,
>
>My research group recently bought a GTX780Ti (model number:GV-N78TOC-3GD)
>for running pmemd.cuda, and it's been consistently printing out "NaN" in
>the TEMP field of an mdcrd.out and reporting "cudaMemcpy
>GpuBuffer::Download failed unspecified launch failure." It will run for a
>little bit, often a couple thousand frames, which look normal when
>visualised in VMD, before suddenly cutting out.
>
>I've managed to find a few threads on this issue, however they seemed
>mostly to be caused by out-of-date Amber builds, and the only one I saw
>about a GTX780Ti (http://archive.ambermd.org/201401/0378.html) seemed to
>be just due to a bad card.
>
>I've made sure my Amber build is up-to-date. "update_amber --update" says
>no new updates and "update_amber -v" gives AmberTools version 13.24,
>Amber version 12.21
>
>This exact same system (and a bunch of analogous ones) has run without
>problems on the pair of Tesla C2050's we have. I also tried running the
>longer ambermd.org tutorial B1 runs with pmemd.cuda and I get the same
>types of crashes.
>
>To my unexperienced eyes the "cudaMemcpy buffer" error seemed to be a
>memory error with the card, so I ran the jobs with cuda-memcheck. However
>every time I do this the job runs fine (albiet very slowly) and with 0
>errors in the summary. When it moves onto the next job without memcheck I
>then get a crash.
>
>Suspecting overheating I monitored the card's temprature with nvidia-smi
>while on full load, and it gets up to 83C with 70% fan which I don't
>think is out of tolerance.
>
>When I run tests with make test.cuda I get 89 passes and 8 fails, and on
>inspection of the .diff file they are just errors in the last decimal
>place. Similarly the benchmark suite passes fine and gives ns/day in line
>with gpu benchmarks on ambermd.org.
>
>For completeness sake I've also recompiled Amber with -cuda_SPDP,
>-cuda_DPD and the intel compilers and I get the same story.
>
>I'm running CentOS 6.5, and CUDA 6.0/5.5, with nvidia driver version
>v337.12. I installed these by hand as sadly I can't yum install directly
>from the cuda and elrepo repositories, as the cuda repo provides v319.37
>which the Amber website says is incompatible. I previously tried v331.49
>but now that ambermd.org/gpu says GTX780Ti is supported I updated the
>drivers, and still get the issue.
>
>My supervisor also mentioned that the motherboard (Gigabyte P35-DS3P)
>seems to only support PCI-E 1.x. Could this be the cause of the issue?
>
>I'm now a bit of a loss as to what the issue is. Is there something I am
>missing, or is this just a bad card?
>
>Gratefully,
>
>Keiran Rowell
>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber



_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Apr 16 2014 - 11:00:03 PDT
Custom Search