Re: [AMBER] Nvidia driver bug for GTX400 series gpu's/update 1

From: Sergio R Aragon <aragons.sfsu.edu>
Date: Mon, 6 Sep 2010 04:34:43 +0000

Hello Ross,

Thanks for your thoughtful suggestions. I built this Cuda machine just a month ago, so I am using the latest CUDA SDK 3.1, published in June 2010. The nvidia driver for my card is version 256.35 (NVIDIA-Linux-x86_64-256.35). I just found a driver update version 256.53 that I will install.

As for Amber 11, I compiled with the Master Bugfix File with fixes 1-6, and likewise for AmberTools 1.4, fixes 1-6. I will look for any updated ones and start all over again. It would be good to start with the new binary you offered for me to try. BTW, the jobs that show the error run fine on standard multiprocessors with Amber 10 - they just take quite a bit longer to run.

I am willing to provide my Amber 11 input files to see if any one of you wishes to try to reproduce the error in a 400 series card. Recall that Sasha Buzko has shown the error to appear in 8 different GTX 480 cards. Mine is the ninth 400 series card to show the problem. Or possibly the tenth because there was an earlier report on this mail list of such a launch error that did not generate that much correspondence. The files are too big to send via email, but an ftp process can be arranged if there is interest.

A second protein, with 69,000 atoms just failed with the same launch time out error after 1.7 ns of NVT production. The file is 1cts, a dimeric protein which has already accumulated nearly 20 ns on my regular 16 processor machine, running Amber 10. I am going to try to run the failed simulations under the DPDP cuda model now. The other protein that has failed ins 1faj, inorganic pyrophosphate, a multimeric protein with six subunits, not covalently bonded to each other.

I will report again after I have made all the necessary updates.
Thanks, Sergio





-----Original Message-----
From: Ross Walker [mailto:ross.rosswalker.co.uk]
Sent: Saturday, September 04, 2010 10:36 AM
To: 'AMBER Mailing List'
Subject: Re: [AMBER] Nvidia driver bug for GTX400 series gpu's

Hi Sergio,

Further to Scott's emails let me add my 3c here. I am also not convinced
this is a driver issue. My bet is on a memory leak in the code somewhere or
some other subtle bug. The issue right now, is with so many other pressing
deadlines and issues in the way, it is difficult to find the time to sit
down and work out exactly what is going on here. If you would like to lend a
hand debugging what is going on I would be very grateful. Firstly the thing
I find VERY strange is that it always happens in the same routine. This is
very suspect. Clearly something is going on prior to the call of this
routine and it will be critical to know exactly what happened immediately
before the crash. The fact that it takes a long time to crash and appears to
be largely random makes debugging a complete nightmare.

I think right now is that all we can hope is that this has been
'accidentally' fixed in some other update of the code.

So, a few requests.

1) Can you please update your machine to make sure it is running the very
latest driver, very latest nvidia toolkit and the very latest bios's for
your card and motherboard. Then try recompiling AMBER 11 from scratch with
all the bugfixes and see if the problem still occurs.

2) Off list I will send you an updated binary to try. This contains a number
of fixes that are in the development tree right now, and scheduled for
release with the upcoming parallel version of the code. Maybe one of these
has accidentally fixed the problem. This will need the latest drivers and
compilers though so please make sure your machine has these installed.

All the best
Ross

> -----Original Message-----
> From: Sergio R Aragon [mailto:aragons.sfsu.edu]
> Sent: Friday, September 03, 2010 7:35 PM
> To: AMBER Mailing List
> Subject: [AMBER] Nvidia driver bug for GTX400 series gpu's
>
> Dear Scott Legrand and other interested users,
>
> >From the data below, I have concluded that there is probably a bug in
> the Nvidia driver as used by the GTX400 series GPU's.
>
> The following error message is produced by GTX 400 series cards running
> pmemd.cuda:
>
> Error: the launch timed out and was terminated launching kernel
> kPMEGetGridWeights
>
> Number of cards tested:
> 8 GTX 480, made by ASUS (data published here by Sasha Buzko)
> 1 GTX 470, made by MSI (reported by Sergio Aragon)
>
> Hypothesis tests:
>
> 1. The power supplies in use are too small.
> Nvidia recommends 550 W supply for the 470 card; Aragon uses 650
> W.
> Nvidia recommends 600 W supply for the 480 card; Buzko uses 2 kW
> Thus it is not a power supply issue.
>
> 2. The card temperature spikes or is too high => cooling is inadequate
> The following measurements are made on the GTX 470 card using the
> nvidia-settings tool in graphical mode.
>
> Idle: 43 C, gpu; 35 C gpu board; fan 40%, Performance level = 0, 50
> MHz
>
> Running pmemd.cuda
> 36,400 atoms: 85 C, gpu; 57 C gpu board, fan 56 %, Performance level =
> 4, 607 MHz
> 63,000 atoms: 87 C, gpu; 59 C gpu board, fan 61%, Perf level = 3, 607
> MHz.
>
> Cards were not over clocked and used as provided by the manufacturers.
> Nvidia maximum temperature = 105 C.
>
> Anandtech review (attached) of GPU's shows gaming environments running
> these cards at 93 C gpu temperature w/o any noticeable effects. In
> that application, cuda is not involved.
>
> This data shows that the card is being operated at acceptable
> temperatures and that cooling is adequate.
>
> 3. The error resides in the Amber 11 Cuda port.
>
> The error does not appear in GTX200 (240 at SFSU, 295 Ross Walker?)
> series cards nor on Tesla (C1060 Genentech, C2050 Ross Walker) running
> pmemd.cuda. Thus the Amber port cannot be faulted.
>
> 4. The error appears because the system ran out of memory
>
> Ross Walker estimates that a GTX480 card should handle 95,000 atoms,
> which implies that the 470 card should handle 75,000. The fact that
> the behavior of the cards appears well below these numbers (even 20,000
> atoms on all eight 480 cards), and that the computation runs for many
> hours before crashing (including constant temperature and constant
> pressure ensembles, in production mode), shows that the problem is not
> one of running out of gpu memory. (Nor CPU associated memory either,
> 470 system has 8 GB of ram and only about a third is in use).
>
> 5. The error appears because the X server was competing with the cuda
> job.
>
> The error appears whether the X server is running or not. In tests
> that I made, I ran w/o X and also first starting the nvidia-settings
> tool via a console login that uses the card for display, then I started
> the cuda job. The behavior was the same in both cases. After about 1
> ns of production (post energy minimization and constrained MD),
> regardless of ensemble, the error appears on the 470 card at 63,000
> atoms, but never at 36,400 atoms. There appears to be system size
> dependence on the 470 card, but on the 480 cards, the error appears for
> even 20,000 atoms. The 480 cards are run in init level 3 where the X
> server can't possibly start. We conclude that the X- server
> interference is not the cause of the error.
>
> There are very few things left to blame, and the salient one is the
> driver. The Cuda driver 3.1 treats all of the cards, 200's, 400's and
> Teslas, differently. The hardware in the 470 card is the exact same
> Fermi chip as in the C2050, but its performance is turned down by the
> driver. It does not appear that the 200 series cards have this special
> treatment. The clock speed on the C2050 is 1.55 MHz, compared to 1.215
> MHz for the 470. This undoubtedly reflects the cooling capacity of the
> implementation. The temperature data above shows that the 470 cars is
> being used well within specifications and is being cooled adequately.
> The 480 card has different hardware, with 480 processors and may be
> treated differently by the driver compared to the 470 card (448
> processors), perhaps explaining why the 480 card fails even for small
> systems. ASUS is an excellent hardware manufacturer and the fact that
> all 8 of its cards showed this behavior again points to the driver, not
> the hardware or other software.
>
> I think it is time for Nvidia to take a hard look at what is going on
> here. I understand that Nvidia wants to sell Teslas and avoid consumer
> card competition in the HPC market. There is enough differentiation
> because of the memory capacities of the cards, however. Those of us
> who purchase this hardware deserve a bug free product.
>
> Thanks,
>
> Sergio Aragon
> Professor of Chemistry
> San Francisco State University
>
>
>



_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber



_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sun Sep 05 2010 - 22:00:03 PDT
Custom Search