Re: [AMBER] GTX780Ti consistent error "cudaMemcpy GpuBuffer::Download failed unspecified launch failure." from Keiran Rowell on 2014-06-12 (Amber Archive Jun 2014)

From: Keiran Rowell <k.rowell.unsw.edu.au>
Date: Fri, 13 Jun 2014 01:40:14 +0000

Dear all,

We've upgraded to a Z87 chipset motherboard and now the 780Ti is in a PCI-E 3.0 x16 slot, but I still encountered the same errors. However "Coolbits" "12" now seems to be working, so I can change the clock speed in nvidia-settings.

Changing the clock by an offest of -105MHz (the min allowed in the GUI) gave a Graphics Clock of 1033Mhz. When runinng the GPU_validation_Test at this speed 17/20 runs got to completion, but none of them had the same value, and no memcpy error was reported. Each crash had "ERROR: Calculation halted. Periodic box dimensions have changed too much from their initial values" was.

I found here: (http://wiki.etechnik-rieke.de/index.php/NVidia_PowerMizer) how to force it onto the lowest performance level which nvidia-settings reports as 549Mhz. Running with this I don't get any crashes, however only 7/20 values printed agree with each other, indicating the artimatic is still inconsistent.

I've found another thread where they have issues with the exact same same model: (www.gpugrid.net/forum_thread.php?id=3584) and they solved their problem by reducing their memory transfer rate to 2700MHz (the default is 7000Mhz). However I can't seem to lower the memory transfer rate either through the GUI or command-line.

"nvidia-settings --assign [gpu:0]/GPUMemoryTransferRateOffset[x]=-y"
gives the error:
ERROR: Error assigning value -10 to attribute 'GPUMemoryTransferRateOffset'
(Ashley:0[gpu:0]) as specified in assignment
'[gpu:0]/GPUMemoryTransferRateOffset[3]=-10' (Unknown Error).

I've found someone else with the same issue on Linux: https://foldingforum.org/viewtopic.php?f=38&t=26207&sid=bf35e17ee5e122071a2112028565d4e0&start=30

So underclocking appears to have made the card more stable, but it still appears unreliable. I think I should contact Gigabyte directly and see if they will authorise and exchange for a reliable card.

I'm not sure if I've been any more informative on the problem.

Thanks everyone for your help,

Keiran

________________________________________
From: Scott Le Grand [varelse2005.gmail.com]
Sent: 03 June 2014 15:45
To: AMBER Mailing List
Subject: Re: [AMBER] GTX780Ti consistent error "cudaMemcpy GpuBuffer::Download failed unspecified launch failure."

So the 780 TI has 2880 cores just like Titan Black.

This explains why when they work they keep right up with it, and beat it
with overclocking.

But they are definitely flaky (unlike GTX 780). I am curious how stable
they'd be at Titan Black clocks (900-980 MHz)...

But otherwise, I'd just buy Titan Blacks.

On Mon, Jun 2, 2014 at 6:34 PM, Keiran Rowell <k.rowell.unsw.edu.au> wrote:

> CUDA 's deviceQuery says 1084MHz.
>
> I would try downclocking it but adding "Coolbits" "8" (or "12") to my
> xorg.conf doesn't seem to work, the performance menu becomes enabled for
> the 8600GT also in the box, but not the 780Ti
>
> This is the model for those interested:
>
> http://www.gigabyte.com.au/products/product-page.aspx?pid=4839#ov
>
> Cheers,
>
> Keiran
> ________________________________________
> From: Scott Le Grand [varelse2005.gmail.com]
> Sent: 03 June 2014 10:05
> To: AMBER Mailing List
> Subject: Re: [AMBER] GTX780Ti consistent error "cudaMemcpy
> GpuBuffer::Download failed unspecified launch failure."
>
> Just say no to GTX 780TI...
>
>
> What's this one clocked at?
>
>
>
> On Mon, Jun 2, 2014 at 4:54 PM, Keiran Rowell <k.rowell.unsw.edu.au>
> wrote:
>
> > Hi Ross,
> >
> > Tried both switching to CUDA 5.0 and upgrading to NVIDIA beta driver
> > 337.19, and no luck.
> >
> > I ran the GPU validation test and in short it didn't manage to get to
> > 500,000 steps on most runs, a cudaMemcpy error occurred before then. On
> the
> > one time it did manage to grep a final value I got:
> > Etot = -58258.6338 EKtot = 14429.1406 EPtot =
> > -72687.7744
> >
> > Which is a fair bit off the value given in the README file.
> >
> > On another run I also sometimes get "ERROR: Calculation halted. Periodic
> > box dimensions have changed too much from their initial values."
> >
> > Anyway here's a dropbox link if you wanted to check the output:
> >
> > https://www.dropbox.com/sh/m81nwfqsigbncq0/AABdY9XaW51hwRbkDnrD2GhPa
> >
> > We were planning to get a new motherboard anyway, so I'll upgrade and let
> > you know how that goes. I'll also get in touch with the supplier about
> > replacing for a card which is more reliable at stock speeds.
> >
> > Thank you,
> >
> > Keiran
> >
> > ________________________________________
> > From: Ross Walker [ross.rosswalker.co.uk]
> > Sent: 31 May 2014 01:42
> > To: AMBER Mailing List
> > Subject: Re: [AMBER] GTX780Ti consistent error "cudaMemcpy
> > GpuBuffer::Download failed unspecified launch failure."
> >
> > Hi Keiran,
> >
> > In the absence of a hardware issue 'Invalid write etc' do indeed imply a
> > software error. However if the hardware is faulty then the instruction
> > stack may be getting corrupted in which case all bets are off. It's hard
> > to know what it going on here but I still suspect a bad GPU. The 780Ti's
> > have been awful in terms of reliability. I think I'll update the AMBER
> > website to make it clear that these are not recommended. If you look at
> > the performance they come in faster than even the GTX-Titan-Black and yet
> > are lower binned silicon. Crazy overclocking. :-( A few things I would
> try
> > just to see if it helps.
> >
> > Upgrade to the very latest driver - it is 337.19 I believe. Switch to
> CUDA
> > 5.0 (this is what we recommend and test with), update your motherboard
> > bios, do a clean power off reboot and then see what happens. Try the GPU
> > Validation suite I built here:
> > https://dl.dropboxusercontent.com/u/708185/GPU_Validation_Test.tar.gz
> >
> >
> > All the best
> > Ross
> >
> > On 5/29/14, 10:43 PM, "Keiran Rowell" <k.rowell.unsw.edu.au> wrote:
> >
> > >Hi Ross and Filip,
> > >
> > >Apologies for not thanking you for your help earlier, we sent the card
> in
> > >for warranty and I wanted to wait for results before reviving the thread
> > >and the process ended up taking a while.
> > >
> > >I didn't manage to get coolbits working for our old card unfortunately,
> > >so I can't tell you what downclocking would have done. They approved the
> > >replacement, and we requested a standard 780 but they just did a direct
> > >exchange with the same model.
> > >
> > >The downside is the new GTX780Ti is still having errors, on the plus
> side
> > >I'm getting more informative error reports.
> > >
> > >My systems (which run fine on our TeslaC2050s) are still crashing almost
> > >immediately. However the tutorial 1B explicit water simulation ran for
> > >10ns without any errors reported at all. This didn't work before so..
> > >progress, but I'm not sure if that's down to hardware or updated Amber.
> > >
> > >The error which is being reported to stdout for my systems is
> "cudaMemcpy
> > >GpuBuffer::Download failed an illegal memory access was encountered" or
> > >"cudaMemcpy GpuBuffer::Download failed unspecified launch failure"
> > >
> > >So I submitted jobs with cuda-memcheck and get "Invalid __global__ ...
> > >Address xxxx is out of bounds" type errors, so memory access errors of
> > >the form listed in section 3.4 here:
> > >http://docs.nvidia.com/cuda/cuda-memcheck/#axzz32zciX8kj
> > >
> > >My computer science is rusty, but does this indicate a software not
> > >hardware issue? Have I got something wrong in my set-up? I've pasted
> > >excerpts of a cuda-memcheck error at the bottom of this message, as well
> > >as attaching the nohup.out
> > >
> > >I've updated and am now running AmberTools v14.02, Amber v14.00, Nvidia
> > >driver v337.12 (beta), CUDA v6.0.1. pmemd.cuda was compiled with
> > >./configure -cuda gnu. OS is CentOS 6.5
> > >
> > >Thank you heaps!
> > >
> > >Keiran
> > >
> > >________________________________________
> > >From: filip fratev [filipfratev.yahoo.com]
> > >Sent: 18 April 2014 02:14
> > >To: AMBER Mailing List
> > >Subject: Re: [AMBER] GTX780Ti consistent error "cudaMemcpy
> > >GpuBuffer::Download failed unspecified launch failure."
> > >
> > >Hi Keiran and Ross,
> > >>I've seen lots of issues with 780Ti since they appear to be right on
> > >>the edge of stability with regards to clock speed.
> > >
> > >In fact this was exactly my case, but my was the SC version. You might
> > >also prove that if you use 337 driver and downclock your GPU. BTW 337 is
> > >the first driver with this option avalible.
> > >
> > >
> > >Regards,
> > >Filip
> > >
> > >On Wednesday, April 16, 2014 8:45 PM, Ross Walker <
> ross.rosswalker.co.uk>
> > >wrote:
> > >
> > >Hi Keiran,
> > >
> > >If driver 337.12 does not fix things then it is almost certainly a bad
> > >GPU. I've seen lots of issues with 780Ti since they appear to be right
> on
> > >the edge of stability with regards to clock speed. I recommend returning
> > >it for exchange.
> > >
> > >All the best
> > >Ross
> > >
> > >
> > >On 4/15/14, 11:14 PM, "Keiran Rowell" <k.rowell.unsw.edu.au> wrote:
> > >
> > >>Dear Amber users,
> > >>
> > >>My research group recently bought a GTX780Ti (model
> number:GV-N78TOC-3GD)
> > >>for running pmemd.cuda, and it's been consistently printing out "NaN"
> in
> > >>the TEMP field of an mdcrd.out and reporting "cudaMemcpy
> > >>GpuBuffer::Download failed unspecified launch failure." It will run
> for a
> > >>little bit, often a couple thousand frames, which look normal when
> > >>visualised in VMD, before suddenly cutting out.
> > >>
> > >>I've managed to find a few threads on this issue, however they seemed
> > >>mostly to be caused by out-of-date Amber builds, and the only one I saw
> > >>about a GTX780Ti (http://archive.ambermd.org/201401/0378.html) seemed
> to
> > >>be just due to a bad card.
> > >>
> > >>I've made sure my Amber build is up-to-date. "update_amber --update"
> says
> > >>no new updates and "update_amber -v" gives AmberTools version 13.24,
> > >>Amber version 12.21
> > >>
> > >>This exact same system (and a bunch of analogous ones) has run without
> > >>problems on the pair of Tesla C2050's we have. I also tried running the
> > >>longer ambermd.org tutorial B1 runs with pmemd.cuda and I get the same
> > >>types of crashes.
> > >>
> > >>To my unexperienced eyes the "cudaMemcpy buffer" error seemed to be a
> > >>memory error with the card, so I ran the jobs with cuda-memcheck.
> However
> > >>every time I do this the job runs fine (albiet very slowly) and with 0
> > >>errors in the summary. When it moves onto the next job without
> memcheck I
> > >>then get a crash.
> > >>
> > >>Suspecting overheating I monitored the card's temprature with
> nvidia-smi
> > >>while on full load, and it gets up to 83C with 70% fan which I don't
> > >>think is out of tolerance.
> > >>
> > >>When I run tests with make test.cuda I get 89 passes and 8 fails, and
> on
> > >>inspection of the .diff file they are just errors in the last decimal
> > >>place. Similarly the benchmark suite passes fine and gives ns/day in
> line
> > >>with gpu benchmarks on ambermd.org.
> > >>
> > >>For completeness sake I've also recompiled Amber with -cuda_SPDP,
> > >>-cuda_DPD and the intel compilers and I get the same story.
> > >>
> > >>I'm running CentOS 6.5, and CUDA 6.0/5.5, with nvidia driver version
> > >>v337.12. I installed these by hand as sadly I can't yum install
> directly
> > >>from the cuda and elrepo repositories, as the cuda repo provides
> v319.37
> > >>which the Amber website says is incompatible. I previously tried
> v331.49
> > >>but now that ambermd.org/gpu says GTX780Ti is supported I updated the
> > >>drivers, and still get the issue.
> > >>
> > >>My supervisor also mentioned that the motherboard (Gigabyte P35-DS3P)
> > >>seems to only support PCI-E 1.x. Could this be the cause of the issue?
> > >>
> > >>I'm now a bit of a loss as to what the issue is. Is there something I
> am
> > >>missing, or is this just a bad card?
> > >>
> > >>Gratefully,
> > >>
> > >>Keiran Rowell
> > >>
> > >>_______________________________________________
> > >>AMBER mailing list
> > >>AMBER.ambermd.org
> > >>http://lists.ambermd.org/mailman/listinfo/amber
> > >
> > >
> > >
> > >_______________________________________________
> > >AMBER mailing list
> > >AMBER.ambermd.org
> > >http://lists.ambermd.org/mailman/listinfo/amber
> > >_______________________________________________
> > >AMBER mailing list
> > >AMBER.ambermd.org
> > >http://lists.ambermd.org/mailman/listinfo/amber
> > >_______________________________________________
> > >AMBER mailing list
> > >AMBER.ambermd.org
> > >http://lists.ambermd.org/mailman/listinfo/amber
> >
> >
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jun 12 2014 - 19:00:02 PDT