Re: [AMBER] GTX780Ti consistent error "cudaMemcpy GpuBuffer::Download failed unspecified launch failure."

From: Ross Walker <ross.rosswalker.co.uk>
Date: Thu, 12 Jun 2014 19:10:20 -0700

Hi Keiran,

You are flogging the proverbially dead horse here. That card is defective,
only fix is to send it back and get it replaced.

I have been seeing so many errors with 780TIs that I am going to remove
them from the benchmark page and officially list them as not recommended.
I'd stick with regular GTX780 or GTX-Titan-Black.

All the best
Ross


On 6/12/14, 6:40 PM, "Keiran Rowell" <k.rowell.unsw.edu.au> wrote:

>Dear all,
>
>We've upgraded to a Z87 chipset motherboard and now the 780Ti is in a
>PCI-E 3.0 x16 slot, but I still encountered the same errors. However
>"Coolbits" "12" now seems to be working, so I can change the clock speed
>in nvidia-settings.
>
>Changing the clock by an offest of -105MHz (the min allowed in the GUI)
>gave a Graphics Clock of 1033Mhz. When runinng the GPU_validation_Test at
>this speed 17/20 runs got to completion, but none of them had the same
>value, and no memcpy error was reported. Each crash had "ERROR:
>Calculation halted. Periodic box dimensions have changed too much from
>their initial values" was.
>
>I found here: (http://wiki.etechnik-rieke.de/index.php/NVidia_PowerMizer)
>how to force it onto the lowest performance level which nvidia-settings
>reports as 549Mhz. Running with this I don't get any crashes, however
>only 7/20 values printed agree with each other, indicating the artimatic
>is still inconsistent.
>
>I've found another thread where they have issues with the exact same same
>model: (www.gpugrid.net/forum_thread.php?id=3584) and they solved their
>problem by reducing their memory transfer rate to 2700MHz (the default is
>7000Mhz). However I can't seem to lower the memory transfer rate either
>through the GUI or command-line.
>
>"nvidia-settings --assign [gpu:0]/GPUMemoryTransferRateOffset[x]=-y"
>gives the error:
>ERROR: Error assigning value -10 to attribute
>'GPUMemoryTransferRateOffset'
> (Ashley:0[gpu:0]) as specified in assignment
> '[gpu:0]/GPUMemoryTransferRateOffset[3]=-10' (Unknown Error).
>
>I've found someone else with the same issue on Linux:
>https://foldingforum.org/viewtopic.php?f=38&t=26207&sid=bf35e17ee5e122071a
>2112028565d4e0&start=30
>
>So underclocking appears to have made the card more stable, but it still
>appears unreliable. I think I should contact Gigabyte directly and see if
>they will authorise and exchange for a reliable card.
>
>I'm not sure if I've been any more informative on the problem.
>
>Thanks everyone for your help,
>
>Keiran
>
>________________________________________
>From: Scott Le Grand [varelse2005.gmail.com]
>Sent: 03 June 2014 15:45
>To: AMBER Mailing List
>Subject: Re: [AMBER] GTX780Ti consistent error "cudaMemcpy
>GpuBuffer::Download failed unspecified launch failure."
>
>So the 780 TI has 2880 cores just like Titan Black.
>
>This explains why when they work they keep right up with it, and beat it
>with overclocking.
>
>But they are definitely flaky (unlike GTX 780). I am curious how stable
>they'd be at Titan Black clocks (900-980 MHz)...
>
>But otherwise, I'd just buy Titan Blacks.
>
>
>
>On Mon, Jun 2, 2014 at 6:34 PM, Keiran Rowell <k.rowell.unsw.edu.au>
>wrote:
>
>> CUDA 's deviceQuery says 1084MHz.
>>
>> I would try downclocking it but adding "Coolbits" "8" (or "12") to my
>> xorg.conf doesn't seem to work, the performance menu becomes enabled for
>> the 8600GT also in the box, but not the 780Ti
>>
>> This is the model for those interested:
>>
>> http://www.gigabyte.com.au/products/product-page.aspx?pid=4839#ov
>>
>> Cheers,
>>
>> Keiran
>> ________________________________________
>> From: Scott Le Grand [varelse2005.gmail.com]
>> Sent: 03 June 2014 10:05
>> To: AMBER Mailing List
>> Subject: Re: [AMBER] GTX780Ti consistent error "cudaMemcpy
>> GpuBuffer::Download failed unspecified launch failure."
>>
>> Just say no to GTX 780TI...
>>
>>
>> What's this one clocked at?
>>
>>
>>
>> On Mon, Jun 2, 2014 at 4:54 PM, Keiran Rowell <k.rowell.unsw.edu.au>
>> wrote:
>>
>> > Hi Ross,
>> >
>> > Tried both switching to CUDA 5.0 and upgrading to NVIDIA beta driver
>> > 337.19, and no luck.
>> >
>> > I ran the GPU validation test and in short it didn't manage to get to
>> > 500,000 steps on most runs, a cudaMemcpy error occurred before then.
>>On
>> the
>> > one time it did manage to grep a final value I got:
>> > Etot = -58258.6338 EKtot = 14429.1406 EPtot =
>> > -72687.7744
>> >
>> > Which is a fair bit off the value given in the README file.
>> >
>> > On another run I also sometimes get "ERROR: Calculation halted.
>>Periodic
>> > box dimensions have changed too much from their initial values."
>> >
>> > Anyway here's a dropbox link if you wanted to check the output:
>> >
>> > https://www.dropbox.com/sh/m81nwfqsigbncq0/AABdY9XaW51hwRbkDnrD2GhPa
>> >
>> > We were planning to get a new motherboard anyway, so I'll upgrade and
>>let
>> > you know how that goes. I'll also get in touch with the supplier about
>> > replacing for a card which is more reliable at stock speeds.
>> >
>> > Thank you,
>> >
>> > Keiran
>> >
>> > ________________________________________
>> > From: Ross Walker [ross.rosswalker.co.uk]
>> > Sent: 31 May 2014 01:42
>> > To: AMBER Mailing List
>> > Subject: Re: [AMBER] GTX780Ti consistent error "cudaMemcpy
>> > GpuBuffer::Download failed unspecified launch failure."
>> >
>> > Hi Keiran,
>> >
>> > In the absence of a hardware issue 'Invalid write etc' do indeed
>>imply a
>> > software error. However if the hardware is faulty then the instruction
>> > stack may be getting corrupted in which case all bets are off. It's
>>hard
>> > to know what it going on here but I still suspect a bad GPU. The
>>780Ti's
>> > have been awful in terms of reliability. I think I'll update the AMBER
>> > website to make it clear that these are not recommended. If you look
>>at
>> > the performance they come in faster than even the GTX-Titan-Black and
>>yet
>> > are lower binned silicon. Crazy overclocking. :-( A few things I would
>> try
>> > just to see if it helps.
>> >
>> > Upgrade to the very latest driver - it is 337.19 I believe. Switch to
>> CUDA
>> > 5.0 (this is what we recommend and test with), update your motherboard
>> > bios, do a clean power off reboot and then see what happens. Try the
>>GPU
>> > Validation suite I built here:
>> > https://dl.dropboxusercontent.com/u/708185/GPU_Validation_Test.tar.gz
>> >
>> >
>> > All the best
>> > Ross
>> >
>> > On 5/29/14, 10:43 PM, "Keiran Rowell" <k.rowell.unsw.edu.au> wrote:
>> >
>> > >Hi Ross and Filip,
>> > >
>> > >Apologies for not thanking you for your help earlier, we sent the
>>card
>> in
>> > >for warranty and I wanted to wait for results before reviving the
>>thread
>> > >and the process ended up taking a while.
>> > >
>> > >I didn't manage to get coolbits working for our old card
>>unfortunately,
>> > >so I can't tell you what downclocking would have done. They approved
>>the
>> > >replacement, and we requested a standard 780 but they just did a
>>direct
>> > >exchange with the same model.
>> > >
>> > >The downside is the new GTX780Ti is still having errors, on the plus
>> side
>> > >I'm getting more informative error reports.
>> > >
>> > >My systems (which run fine on our TeslaC2050s) are still crashing
>>almost
>> > >immediately. However the tutorial 1B explicit water simulation ran
>>for
>> > >10ns without any errors reported at all. This didn't work before so..
>> > >progress, but I'm not sure if that's down to hardware or updated
>>Amber.
>> > >
>> > >The error which is being reported to stdout for my systems is
>> "cudaMemcpy
>> > >GpuBuffer::Download failed an illegal memory access was encountered"
>>or
>> > >"cudaMemcpy GpuBuffer::Download failed unspecified launch failure"
>> > >
>> > >So I submitted jobs with cuda-memcheck and get "Invalid __global__
>>...
>> > >Address xxxx is out of bounds" type errors, so memory access errors
>>of
>> > >the form listed in section 3.4 here:
>> > >http://docs.nvidia.com/cuda/cuda-memcheck/#axzz32zciX8kj
>> > >
>> > >My computer science is rusty, but does this indicate a software not
>> > >hardware issue? Have I got something wrong in my set-up? I've pasted
>> > >excerpts of a cuda-memcheck error at the bottom of this message, as
>>well
>> > >as attaching the nohup.out
>> > >
>> > >I've updated and am now running AmberTools v14.02, Amber v14.00,
>>Nvidia
>> > >driver v337.12 (beta), CUDA v6.0.1. pmemd.cuda was compiled with
>> > >./configure -cuda gnu. OS is CentOS 6.5
>> > >
>> > >Thank you heaps!
>> > >
>> > >Keiran
>> > >
>> > >________________________________________
>> > >From: filip fratev [filipfratev.yahoo.com]
>> > >Sent: 18 April 2014 02:14
>> > >To: AMBER Mailing List
>> > >Subject: Re: [AMBER] GTX780Ti consistent error "cudaMemcpy
>> > >GpuBuffer::Download failed unspecified launch failure."
>> > >
>> > >Hi Keiran and Ross,
>> > >>I've seen lots of issues with 780Ti since they appear to be right on
>> > >>the edge of stability with regards to clock speed.
>> > >
>> > >In fact this was exactly my case, but my was the SC version. You
>>might
>> > >also prove that if you use 337 driver and downclock your GPU. BTW
>>337 is
>> > >the first driver with this option avalible.
>> > >
>> > >
>> > >Regards,
>> > >Filip
>> > >
>> > >On Wednesday, April 16, 2014 8:45 PM, Ross Walker <
>> ross.rosswalker.co.uk>
>> > >wrote:
>> > >
>> > >Hi Keiran,
>> > >
>> > >If driver 337.12 does not fix things then it is almost certainly a
>>bad
>> > >GPU. I've seen lots of issues with 780Ti since they appear to be
>>right
>> on
>> > >the edge of stability with regards to clock speed. I recommend
>>returning
>> > >it for exchange.
>> > >
>> > >All the best
>> > >Ross
>> > >
>> > >
>> > >On 4/15/14, 11:14 PM, "Keiran Rowell" <k.rowell.unsw.edu.au> wrote:
>> > >
>> > >>Dear Amber users,
>> > >>
>> > >>My research group recently bought a GTX780Ti (model
>> number:GV-N78TOC-3GD)
>> > >>for running pmemd.cuda, and it's been consistently printing out
>>"NaN"
>> in
>> > >>the TEMP field of an mdcrd.out and reporting "cudaMemcpy
>> > >>GpuBuffer::Download failed unspecified launch failure." It will run
>> for a
>> > >>little bit, often a couple thousand frames, which look normal when
>> > >>visualised in VMD, before suddenly cutting out.
>> > >>
>> > >>I've managed to find a few threads on this issue, however they
>>seemed
>> > >>mostly to be caused by out-of-date Amber builds, and the only one I
>>saw
>> > >>about a GTX780Ti (http://archive.ambermd.org/201401/0378.html)
>>seemed
>> to
>> > >>be just due to a bad card.
>> > >>
>> > >>I've made sure my Amber build is up-to-date. "update_amber --update"
>> says
>> > >>no new updates and "update_amber -v" gives AmberTools version 13.24,
>> > >>Amber version 12.21
>> > >>
>> > >>This exact same system (and a bunch of analogous ones) has run
>>without
>> > >>problems on the pair of Tesla C2050's we have. I also tried running
>>the
>> > >>longer ambermd.org tutorial B1 runs with pmemd.cuda and I get the
>>same
>> > >>types of crashes.
>> > >>
>> > >>To my unexperienced eyes the "cudaMemcpy buffer" error seemed to be
>>a
>> > >>memory error with the card, so I ran the jobs with cuda-memcheck.
>> However
>> > >>every time I do this the job runs fine (albiet very slowly) and
>>with 0
>> > >>errors in the summary. When it moves onto the next job without
>> memcheck I
>> > >>then get a crash.
>> > >>
>> > >>Suspecting overheating I monitored the card's temprature with
>> nvidia-smi
>> > >>while on full load, and it gets up to 83C with 70% fan which I don't
>> > >>think is out of tolerance.
>> > >>
>> > >>When I run tests with make test.cuda I get 89 passes and 8 fails,
>>and
>> on
>> > >>inspection of the .diff file they are just errors in the last
>>decimal
>> > >>place. Similarly the benchmark suite passes fine and gives ns/day in
>> line
>> > >>with gpu benchmarks on ambermd.org.
>> > >>
>> > >>For completeness sake I've also recompiled Amber with -cuda_SPDP,
>> > >>-cuda_DPD and the intel compilers and I get the same story.
>> > >>
>> > >>I'm running CentOS 6.5, and CUDA 6.0/5.5, with nvidia driver version
>> > >>v337.12. I installed these by hand as sadly I can't yum install
>> directly
>> > >>from the cuda and elrepo repositories, as the cuda repo provides
>> v319.37
>> > >>which the Amber website says is incompatible. I previously tried
>> v331.49
>> > >>but now that ambermd.org/gpu says GTX780Ti is supported I updated
>>the
>> > >>drivers, and still get the issue.
>> > >>
>> > >>My supervisor also mentioned that the motherboard (Gigabyte
>>P35-DS3P)
>> > >>seems to only support PCI-E 1.x. Could this be the cause of the
>>issue?
>> > >>
>> > >>I'm now a bit of a loss as to what the issue is. Is there something
>>I
>> am
>> > >>missing, or is this just a bad card?
>> > >>
>> > >>Gratefully,
>> > >>
>> > >>Keiran Rowell
>> > >>
>> > >>_______________________________________________
>> > >>AMBER mailing list
>> > >>AMBER.ambermd.org
>> > >>http://lists.ambermd.org/mailman/listinfo/amber
>> > >
>> > >
>> > >
>> > >_______________________________________________
>> > >AMBER mailing list
>> > >AMBER.ambermd.org
>> > >http://lists.ambermd.org/mailman/listinfo/amber
>> > >_______________________________________________
>> > >AMBER mailing list
>> > >AMBER.ambermd.org
>> > >http://lists.ambermd.org/mailman/listinfo/amber
>> > >_______________________________________________
>> > >AMBER mailing list
>> > >AMBER.ambermd.org
>> > >http://lists.ambermd.org/mailman/listinfo/amber
>> >
>> >
>> >
>> > _______________________________________________
>> > AMBER mailing list
>> > AMBER.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber
>> >
>> > _______________________________________________
>> > AMBER mailing list
>> > AMBER.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber
>> >
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber
>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber



_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jun 12 2014 - 19:30:03 PDT
Custom Search