Re: [AMBER] GTX780Ti consistent error "cudaMemcpy GpuBuffer::Download failed unspecified launch failure."

From: filip fratev <filipfratev.yahoo.com>
Date: Thu, 17 Apr 2014 09:14:31 -0700 (PDT)

Hi Keiran and Ross,
>I've seen lots of issues with 780Ti since they appear to be right on
>the edge of stability with regards to clock speed.

In fact this was exactly my case, but my was the SC version. You might also prove that if you use 337 driver and downclock your GPU. BTW 337 is the first driver with this option avalible.


Regards,
Filip

On Wednesday, April 16, 2014 8:45 PM, Ross Walker <ross.rosswalker.co.uk> wrote:
 
Hi Keiran,

If driver 337.12 does not fix things then it is almost certainly a bad
GPU. I've seen lots of issues with 780Ti since they appear to be right on
the edge of stability with regards to clock speed. I recommend returning
it for exchange.

All the best
Ross


On 4/15/14, 11:14 PM, "Keiran Rowell" <k.rowell.unsw.edu.au> wrote:

>Dear Amber users,
>
>My research group recently bought a GTX780Ti (model number:GV-N78TOC-3GD)
>for running pmemd.cuda, and it's been consistently printing out "NaN" in
>the TEMP field of an mdcrd.out and reporting  "cudaMemcpy
>GpuBuffer::Download failed unspecified launch failure." It will run for a
>little bit, often a couple thousand frames, which look normal when
>visualised in VMD, before suddenly cutting out.
>
>I've managed to find a few threads on this issue, however they seemed
>mostly to be caused by out-of-date Amber builds, and the only one I saw
>about a GTX780Ti (http://archive.ambermd.org/201401/0378.html) seemed to
>be just due to a bad card.
>
>I've made sure my Amber build is up-to-date. "update_amber --update" says
>no new updates and "update_amber -v" gives AmberTools version 13.24,
>Amber version 12.21
>
>This exact same system (and a bunch of analogous ones) has run without
>problems on the pair of Tesla C2050's we have. I also tried running the
>longer ambermd.org tutorial B1 runs with pmemd.cuda and I get the same
>types of crashes.
>
>To my unexperienced eyes the "cudaMemcpy buffer" error seemed to be a
>memory error with the card, so I ran the jobs with cuda-memcheck. However
>every time I do this the job runs fine (albiet very slowly) and with 0
>errors in the summary. When it moves onto the next job without memcheck I
>then get a crash.
>
>Suspecting overheating I monitored the card's temprature with nvidia-smi
>while on full load, and it gets up to 83C with 70% fan which I don't
>think is out of tolerance.
>
>When I run tests with make test.cuda I get 89 passes and 8 fails, and on
>inspection of the .diff file they are just errors in the last decimal
>place. Similarly the benchmark suite passes fine and gives ns/day in line
>with gpu benchmarks on ambermd.org.
>
>For completeness sake I've also recompiled Amber with -cuda_SPDP,
>-cuda_DPD and the intel compilers and I get the same story.
>
>I'm running CentOS 6.5, and CUDA 6.0/5.5, with nvidia driver version
>v337.12. I installed these by hand as sadly I can't yum install directly
>from the cuda and elrepo repositories, as the cuda repo provides v319.37
>which the Amber website says is incompatible. I previously tried v331.49
>but now that ambermd.org/gpu says GTX780Ti is supported I updated the
>drivers, and still get the issue.
>
>My supervisor also mentioned that the motherboard (Gigabyte P35-DS3P)
>seems to only support PCI-E 1.x. Could this be the cause of the issue?
>
>I'm now a bit of a loss as to what the issue is. Is there something I am
>missing, or is this just a bad card?
>
>Gratefully,
>
>Keiran Rowell
>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber



_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Apr 17 2014 - 09:30:02 PDT
Custom Search