Dear Amber users,
My research group recently bought a GTX780Ti (model number:GV-N78TOC-3GD) for running pmemd.cuda, and it's been consistently printing out "NaN" in the TEMP field of an mdcrd.out and reporting "cudaMemcpy GpuBuffer::Download failed unspecified launch failure." It will run for a little bit, often a couple thousand frames, which look normal when visualised in VMD, before suddenly cutting out.
I've managed to find a few threads on this issue, however they seemed mostly to be caused by out-of-date Amber builds, and the only one I saw about a GTX780Ti (
http://archive.ambermd.org/201401/0378.html) seemed to be just due to a bad card.
I've made sure my Amber build is up-to-date. "update_amber --update" says no new updates and "update_amber -v" gives AmberTools version 13.24, Amber version 12.21
This exact same system (and a bunch of analogous ones) has run without problems on the pair of Tesla C2050's we have. I also tried running the longer ambermd.org tutorial B1 runs with pmemd.cuda and I get the same types of crashes.
To my unexperienced eyes the "cudaMemcpy buffer" error seemed to be a memory error with the card, so I ran the jobs with cuda-memcheck. However every time I do this the job runs fine (albiet very slowly) and with 0 errors in the summary. When it moves onto the next job without memcheck I then get a crash.
Suspecting overheating I monitored the card's temprature with nvidia-smi while on full load, and it gets up to 83C with 70% fan which I don't think is out of tolerance.
When I run tests with make test.cuda I get 89 passes and 8 fails, and on inspection of the .diff file they are just errors in the last decimal place. Similarly the benchmark suite passes fine and gives ns/day in line with gpu benchmarks on ambermd.org.
For completeness sake I've also recompiled Amber with -cuda_SPDP, -cuda_DPD and the intel compilers and I get the same story.
I'm running CentOS 6.5, and CUDA 6.0/5.5, with nvidia driver version v337.12. I installed these by hand as sadly I can't yum install directly from the cuda and elrepo repositories, as the cuda repo provides v319.37 which the Amber website says is incompatible. I previously tried v331.49 but now that ambermd.org/gpu says GTX780Ti is supported I updated the drivers, and still get the issue.
My supervisor also mentioned that the motherboard (Gigabyte P35-DS3P) seems to only support PCI-E 1.x. Could this be the cause of the issue?
I'm now a bit of a loss as to what the issue is. Is there something I am missing, or is this just a bad card?
Gratefully,
Keiran Rowell
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Apr 15 2014 - 23:30:02 PDT