Re: [AMBER] Nvidia driver bug for GTX400 series gpu's from Scott Le Grand on 2010-09-04 (Amber Archive Sep 2010)

From: Scott Le Grand <SLeGrand.nvidia.com>
Date: Sat, 4 Sep 2010 09:57:11 -0700

BTW I'm not claiming this is the case in any way but...

What toolkit/driver are you using? If you're using 195.xx and 3.0, all GTX4xx chips are hosed with the current release. And it would exhibit exactly this sort of behavior (working on GTX2xx/C1060) because there was a compiler bug that introduced a race condition for SM2.x code. There was a workaround for this in earlier code but since that bug was fixed for 3.1, said workaround was commented out.

-----Original Message-----
From: Sergio R Aragon [mailto:aragons.sfsu.edu]
Sent: Friday, September 03, 2010 19:35
To: AMBER Mailing List
Subject: [AMBER] Nvidia driver bug for GTX400 series gpu's

Dear Scott Legrand and other interested users,

>From the data below, I have concluded that there is probably a bug in the Nvidia driver as used by the GTX400 series GPU's.

The following error message is produced by GTX 400 series cards running pmemd.cuda:

Error: the launch timed out and was terminated launching kernel kPMEGetGridWeights

Number of cards tested:
8 GTX 480, made by ASUS (data published here by Sasha Buzko)
1 GTX 470, made by MSI (reported by Sergio Aragon)

Hypothesis tests:

1. The power supplies in use are too small.
        Nvidia recommends 550 W supply for the 470 card; Aragon uses 650 W.
        Nvidia recommends 600 W supply for the 480 card; Buzko uses 2 kW
        Thus it is not a power supply issue.

2. The card temperature spikes or is too high => cooling is inadequate
        The following measurements are made on the GTX 470 card using the nvidia-settings tool in graphical mode.

Idle: 43 C, gpu; 35 C gpu board; fan 40%, Performance level = 0, 50 MHz

Running pmemd.cuda
36,400 atoms: 85 C, gpu; 57 C gpu board, fan 56 %, Performance level = 4, 607 MHz 63,000 atoms: 87 C, gpu; 59 C gpu board, fan 61%, Perf level = 3, 607 MHz.

Cards were not over clocked and used as provided by the manufacturers.
Nvidia maximum temperature = 105 C.

Anandtech review (attached) of GPU's shows gaming environments running these cards at 93 C gpu temperature w/o any noticeable effects. In that application, cuda is not involved.

This data shows that the card is being operated at acceptable temperatures and that cooling is adequate.

3. The error resides in the Amber 11 Cuda port.

The error does not appear in GTX200 (240 at SFSU, 295 Ross Walker?) series cards nor on Tesla (C1060 Genentech, C2050 Ross Walker) running pmemd.cuda. Thus the Amber port cannot be faulted.

4. The error appears because the system ran out of memory

Ross Walker estimates that a GTX480 card should handle 95,000 atoms, which implies that the 470 card should handle 75,000. The fact that the behavior of the cards appears well below these numbers (even 20,000 atoms on all eight 480 cards), and that the computation runs for many hours before crashing (including constant temperature and constant pressure ensembles, in production mode), shows that the problem is not one of running out of gpu memory. (Nor CPU associated memory either, 470 system has 8 GB of ram and only about a third is in use).

5. The error appears because the X server was competing with the cuda job.

The error appears whether the X server is running or not. In tests that I made, I ran w/o X and also first starting the nvidia-settings tool via a console login that uses the card for display, then I started the cuda job. The behavior was the same in both cases. After about 1 ns of production (post energy minimization and constrained MD), regardless of ensemble, the error appears on the 470 card at 63,000 atoms, but never at 36,400 atoms. There appears to be system size dependence on the 470 card, but on the 480 cards, the error appears for even 20,000 atoms. The 480 cards are run in init level 3 where the X server can't possibly start. We conclude that the X- server interference is not the cause of the error.

There are very few things left to blame, and the salient one is the driver. The Cuda driver 3.1 treats all of the cards, 200's, 400's and Teslas, differently. The hardware in the 470 card is the exact same Fermi chip as in the C2050, but its performance is turned down by the driver. It does not appear that the 200 series cards have this special treatment. The clock speed on the C2050 is 1.55 MHz, compared to 1.215 MHz for the 470. This undoubtedly reflects the cooling capacity of the implementation. The temperature data above shows that the 470 cars is being used well within specifications and is being cooled adequately. The 480 card has different hardware, with 480 processors and may be treated differently by the driver compared to the 470 card (448 processors), perhaps explaining why the 480 card fails even for small systems. ASUS is an excellent hardware manufacturer and the fact that all 8 of its cards showed this behavior again points to the driver, not the hardware or other software.

I think it is time for Nvidia to take a hard look at what is going on here. I understand that Nvidia wants to sell Teslas and avoid consumer card competition in the HPC market. There is enough differentiation because of the memory capacities of the cards, however. Those of us who purchase this hardware deserve a bug free product.

Thanks,

Sergio Aragon
Professor of Chemistry
San Francisco State University

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sat Sep 04 2010 - 10:00:04 PDT