Re: [AMBER] GTX780Ti consistent error "cudaMemcpy GpuBuffer::Download failed unspecified launch failure."

From: Scott Le Grand <varelse2005.gmail.com>
Date: Mon, 2 Jun 2014 17:05:22 -0700

Just say no to GTX 780TI...


What's this one clocked at?



On Mon, Jun 2, 2014 at 4:54 PM, Keiran Rowell <k.rowell.unsw.edu.au> wrote:

> Hi Ross,
>
> Tried both switching to CUDA 5.0 and upgrading to NVIDIA beta driver
> 337.19, and no luck.
>
> I ran the GPU validation test and in short it didn't manage to get to
> 500,000 steps on most runs, a cudaMemcpy error occurred before then. On the
> one time it did manage to grep a final value I got:
> Etot = -58258.6338 EKtot = 14429.1406 EPtot =
> -72687.7744
>
> Which is a fair bit off the value given in the README file.
>
> On another run I also sometimes get "ERROR: Calculation halted. Periodic
> box dimensions have changed too much from their initial values."
>
> Anyway here's a dropbox link if you wanted to check the output:
>
> https://www.dropbox.com/sh/m81nwfqsigbncq0/AABdY9XaW51hwRbkDnrD2GhPa
>
> We were planning to get a new motherboard anyway, so I'll upgrade and let
> you know how that goes. I'll also get in touch with the supplier about
> replacing for a card which is more reliable at stock speeds.
>
> Thank you,
>
> Keiran
>
> ________________________________________
> From: Ross Walker [ross.rosswalker.co.uk]
> Sent: 31 May 2014 01:42
> To: AMBER Mailing List
> Subject: Re: [AMBER] GTX780Ti consistent error "cudaMemcpy
> GpuBuffer::Download failed unspecified launch failure."
>
> Hi Keiran,
>
> In the absence of a hardware issue 'Invalid write etc' do indeed imply a
> software error. However if the hardware is faulty then the instruction
> stack may be getting corrupted in which case all bets are off. It's hard
> to know what it going on here but I still suspect a bad GPU. The 780Ti's
> have been awful in terms of reliability. I think I'll update the AMBER
> website to make it clear that these are not recommended. If you look at
> the performance they come in faster than even the GTX-Titan-Black and yet
> are lower binned silicon. Crazy overclocking. :-( A few things I would try
> just to see if it helps.
>
> Upgrade to the very latest driver - it is 337.19 I believe. Switch to CUDA
> 5.0 (this is what we recommend and test with), update your motherboard
> bios, do a clean power off reboot and then see what happens. Try the GPU
> Validation suite I built here:
> https://dl.dropboxusercontent.com/u/708185/GPU_Validation_Test.tar.gz
>
>
> All the best
> Ross
>
> On 5/29/14, 10:43 PM, "Keiran Rowell" <k.rowell.unsw.edu.au> wrote:
>
> >Hi Ross and Filip,
> >
> >Apologies for not thanking you for your help earlier, we sent the card in
> >for warranty and I wanted to wait for results before reviving the thread
> >and the process ended up taking a while.
> >
> >I didn't manage to get coolbits working for our old card unfortunately,
> >so I can't tell you what downclocking would have done. They approved the
> >replacement, and we requested a standard 780 but they just did a direct
> >exchange with the same model.
> >
> >The downside is the new GTX780Ti is still having errors, on the plus side
> >I'm getting more informative error reports.
> >
> >My systems (which run fine on our TeslaC2050s) are still crashing almost
> >immediately. However the tutorial 1B explicit water simulation ran for
> >10ns without any errors reported at all. This didn't work before so..
> >progress, but I'm not sure if that's down to hardware or updated Amber.
> >
> >The error which is being reported to stdout for my systems is "cudaMemcpy
> >GpuBuffer::Download failed an illegal memory access was encountered" or
> >"cudaMemcpy GpuBuffer::Download failed unspecified launch failure"
> >
> >So I submitted jobs with cuda-memcheck and get "Invalid __global__ ...
> >Address xxxx is out of bounds" type errors, so memory access errors of
> >the form listed in section 3.4 here:
> >http://docs.nvidia.com/cuda/cuda-memcheck/#axzz32zciX8kj
> >
> >My computer science is rusty, but does this indicate a software not
> >hardware issue? Have I got something wrong in my set-up? I've pasted
> >excerpts of a cuda-memcheck error at the bottom of this message, as well
> >as attaching the nohup.out
> >
> >I've updated and am now running AmberTools v14.02, Amber v14.00, Nvidia
> >driver v337.12 (beta), CUDA v6.0.1. pmemd.cuda was compiled with
> >./configure -cuda gnu. OS is CentOS 6.5
> >
> >Thank you heaps!
> >
> >Keiran
> >
> >________________________________________
> >From: filip fratev [filipfratev.yahoo.com]
> >Sent: 18 April 2014 02:14
> >To: AMBER Mailing List
> >Subject: Re: [AMBER] GTX780Ti consistent error "cudaMemcpy
> >GpuBuffer::Download failed unspecified launch failure."
> >
> >Hi Keiran and Ross,
> >>I've seen lots of issues with 780Ti since they appear to be right on
> >>the edge of stability with regards to clock speed.
> >
> >In fact this was exactly my case, but my was the SC version. You might
> >also prove that if you use 337 driver and downclock your GPU. BTW 337 is
> >the first driver with this option avalible.
> >
> >
> >Regards,
> >Filip
> >
> >On Wednesday, April 16, 2014 8:45 PM, Ross Walker <ross.rosswalker.co.uk>
> >wrote:
> >
> >Hi Keiran,
> >
> >If driver 337.12 does not fix things then it is almost certainly a bad
> >GPU. I've seen lots of issues with 780Ti since they appear to be right on
> >the edge of stability with regards to clock speed. I recommend returning
> >it for exchange.
> >
> >All the best
> >Ross
> >
> >
> >On 4/15/14, 11:14 PM, "Keiran Rowell" <k.rowell.unsw.edu.au> wrote:
> >
> >>Dear Amber users,
> >>
> >>My research group recently bought a GTX780Ti (model number:GV-N78TOC-3GD)
> >>for running pmemd.cuda, and it's been consistently printing out "NaN" in
> >>the TEMP field of an mdcrd.out and reporting "cudaMemcpy
> >>GpuBuffer::Download failed unspecified launch failure." It will run for a
> >>little bit, often a couple thousand frames, which look normal when
> >>visualised in VMD, before suddenly cutting out.
> >>
> >>I've managed to find a few threads on this issue, however they seemed
> >>mostly to be caused by out-of-date Amber builds, and the only one I saw
> >>about a GTX780Ti (http://archive.ambermd.org/201401/0378.html) seemed to
> >>be just due to a bad card.
> >>
> >>I've made sure my Amber build is up-to-date. "update_amber --update" says
> >>no new updates and "update_amber -v" gives AmberTools version 13.24,
> >>Amber version 12.21
> >>
> >>This exact same system (and a bunch of analogous ones) has run without
> >>problems on the pair of Tesla C2050's we have. I also tried running the
> >>longer ambermd.org tutorial B1 runs with pmemd.cuda and I get the same
> >>types of crashes.
> >>
> >>To my unexperienced eyes the "cudaMemcpy buffer" error seemed to be a
> >>memory error with the card, so I ran the jobs with cuda-memcheck. However
> >>every time I do this the job runs fine (albiet very slowly) and with 0
> >>errors in the summary. When it moves onto the next job without memcheck I
> >>then get a crash.
> >>
> >>Suspecting overheating I monitored the card's temprature with nvidia-smi
> >>while on full load, and it gets up to 83C with 70% fan which I don't
> >>think is out of tolerance.
> >>
> >>When I run tests with make test.cuda I get 89 passes and 8 fails, and on
> >>inspection of the .diff file they are just errors in the last decimal
> >>place. Similarly the benchmark suite passes fine and gives ns/day in line
> >>with gpu benchmarks on ambermd.org.
> >>
> >>For completeness sake I've also recompiled Amber with -cuda_SPDP,
> >>-cuda_DPD and the intel compilers and I get the same story.
> >>
> >>I'm running CentOS 6.5, and CUDA 6.0/5.5, with nvidia driver version
> >>v337.12. I installed these by hand as sadly I can't yum install directly
> >>from the cuda and elrepo repositories, as the cuda repo provides v319.37
> >>which the Amber website says is incompatible. I previously tried v331.49
> >>but now that ambermd.org/gpu says GTX780Ti is supported I updated the
> >>drivers, and still get the issue.
> >>
> >>My supervisor also mentioned that the motherboard (Gigabyte P35-DS3P)
> >>seems to only support PCI-E 1.x. Could this be the cause of the issue?
> >>
> >>I'm now a bit of a loss as to what the issue is. Is there something I am
> >>missing, or is this just a bad card?
> >>
> >>Gratefully,
> >>
> >>Keiran Rowell
> >>
> >>_______________________________________________
> >>AMBER mailing list
> >>AMBER.ambermd.org
> >>http://lists.ambermd.org/mailman/listinfo/amber
> >
> >
> >
> >_______________________________________________
> >AMBER mailing list
> >AMBER.ambermd.org
> >http://lists.ambermd.org/mailman/listinfo/amber
> >_______________________________________________
> >AMBER mailing list
> >AMBER.ambermd.org
> >http://lists.ambermd.org/mailman/listinfo/amber
> >_______________________________________________
> >AMBER mailing list
> >AMBER.ambermd.org
> >http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Jun 02 2014 - 17:30:01 PDT
Custom Search