[AMBER] GTX780Ti consistent error "cudaMemcpy GpuBuffer::Download failed unspecified launch failure."

From: Keiran Rowell <k.rowell.unsw.edu.au>
Date: Wed, 16 Apr 2014 06:14:58 +0000

Dear Amber users,

My research group recently bought a GTX780Ti (model number:GV-N78TOC-3GD) for running pmemd.cuda, and it's been consistently printing out "NaN" in the TEMP field of an mdcrd.out and reporting "cudaMemcpy GpuBuffer::Download failed unspecified launch failure." It will run for a little bit, often a couple thousand frames, which look normal when visualised in VMD, before suddenly cutting out.

I've managed to find a few threads on this issue, however they seemed mostly to be caused by out-of-date Amber builds, and the only one I saw about a GTX780Ti (http://archive.ambermd.org/201401/0378.html) seemed to be just due to a bad card.

I've made sure my Amber build is up-to-date. "update_amber --update" says no new updates and "update_amber -v" gives AmberTools version 13.24, Amber version 12.21

This exact same system (and a bunch of analogous ones) has run without problems on the pair of Tesla C2050's we have. I also tried running the longer ambermd.org tutorial B1 runs with pmemd.cuda and I get the same types of crashes.

To my unexperienced eyes the "cudaMemcpy buffer" error seemed to be a memory error with the card, so I ran the jobs with cuda-memcheck. However every time I do this the job runs fine (albiet very slowly) and with 0 errors in the summary. When it moves onto the next job without memcheck I then get a crash.

Suspecting overheating I monitored the card's temprature with nvidia-smi while on full load, and it gets up to 83C with 70% fan which I don't think is out of tolerance.

When I run tests with make test.cuda I get 89 passes and 8 fails, and on inspection of the .diff file they are just errors in the last decimal place. Similarly the benchmark suite passes fine and gives ns/day in line with gpu benchmarks on ambermd.org.

For completeness sake I've also recompiled Amber with -cuda_SPDP, -cuda_DPD and the intel compilers and I get the same story.

I'm running CentOS 6.5, and CUDA 6.0/5.5, with nvidia driver version v337.12. I installed these by hand as sadly I can't yum install directly from the cuda and elrepo repositories, as the cuda repo provides v319.37 which the Amber website says is incompatible. I previously tried v331.49 but now that ambermd.org/gpu says GTX780Ti is supported I updated the drivers, and still get the issue.

My supervisor also mentioned that the motherboard (Gigabyte P35-DS3P) seems to only support PCI-E 1.x. Could this be the cause of the issue?

I'm now a bit of a loss as to what the issue is. Is there something I am missing, or is this just a bad card?

Gratefully,

Keiran Rowell

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Apr 15 2014 - 23:30:02 PDT
Custom Search