Re: [AMBER] Nvidia driver bug for GTX400 series gpu's from Scott Le Grand on 2010-09-04 (Amber Archive Sep 2010)

From: Scott Le Grand <SLeGrand.nvidia.com>
Date: Sat, 4 Sep 2010 10:49:26 -0700

Nothing strange about where it happens whatsoever actually. This event indicates the simulation has achieved lightspeed and headed off to (Not so) Happy NaN Land :-(. Or in plain English, a math error has led to a sudden massive overheating of the system. When this happens, the last routine before kPMEGetGridWeights goes nukular(sic) and you see kPMEGetGridWeights get hit with a launch error.

-----Original Message-----
From: Ross Walker [mailto:ross.rosswalker.co.uk]
Sent: Saturday, September 04, 2010 10:36
To: 'AMBER Mailing List'
Subject: Re: [AMBER] Nvidia driver bug for GTX400 series gpu's

Hi Sergio,

Further to Scott's emails let me add my 3c here. I am also not convinced
this is a driver issue. My bet is on a memory leak in the code somewhere or
some other subtle bug. The issue right now, is with so many other pressing
deadlines and issues in the way, it is difficult to find the time to sit
down and work out exactly what is going on here. If you would like to lend a
hand debugging what is going on I would be very grateful. Firstly the thing
I find VERY strange is that it always happens in the same routine. This is
very suspect. Clearly something is going on prior to the call of this
routine and it will be critical to know exactly what happened immediately
before the crash. The fact that it takes a long time to crash and appears to
be largely random makes debugging a complete nightmare.

I think right now is that all we can hope is that this has been
'accidentally' fixed in some other update of the code.

So, a few requests.

1) Can you please update your machine to make sure it is running the very
latest driver, very latest nvidia toolkit and the very latest bios's for
your card and motherboard. Then try recompiling AMBER 11 from scratch with
all the bugfixes and see if the problem still occurs.

2) Off list I will send you an updated binary to try. This contains a number
of fixes that are in the development tree right now, and scheduled for
release with the upcoming parallel version of the code. Maybe one of these
has accidentally fixed the problem. This will need the latest drivers and
compilers though so please make sure your machine has these installed.

All the best
Ross

> -----Original Message-----
> From: Sergio R Aragon [mailto:aragons.sfsu.edu]
> Sent: Friday, September 03, 2010 7:35 PM
> To: AMBER Mailing List
> Subject: [AMBER] Nvidia driver bug for GTX400 series gpu's
>
> Dear Scott Legrand and other interested users,
>
> >From the data below, I have concluded that there is probably a bug in
> the Nvidia driver as used by the GTX400 series GPU's.
>
> The following error message is produced by GTX 400 series cards running
> pmemd.cuda:
>
> Error: the launch timed out and was terminated launching kernel
> kPMEGetGridWeights
>
> Number of cards tested:
> 8 GTX 480, made by ASUS (data published here by Sasha Buzko)
> 1 GTX 470, made by MSI (reported by Sergio Aragon)
>
> Hypothesis tests:
>
> 1. The power supplies in use are too small.
> Nvidia recommends 550 W supply for the 470 card; Aragon uses 650
> W.
> Nvidia recommends 600 W supply for the 480 card; Buzko uses 2 kW
> Thus it is not a power supply issue.
>
> 2. The card temperature spikes or is too high => cooling is inadequate
> The following measurements are made on the GTX 470 card using the
> nvidia-settings tool in graphical mode.
>
> Idle: 43 C, gpu; 35 C gpu board; fan 40%, Performance level = 0, 50
> MHz
>
> Running pmemd.cuda
> 36,400 atoms: 85 C, gpu; 57 C gpu board, fan 56 %, Performance level =
> 4, 607 MHz
> 63,000 atoms: 87 C, gpu; 59 C gpu board, fan 61%, Perf level = 3, 607
> MHz.
>
> Cards were not over clocked and used as provided by the manufacturers.
> Nvidia maximum temperature = 105 C.
>
> Anandtech review (attached) of GPU's shows gaming environments running
> these cards at 93 C gpu temperature w/o any noticeable effects. In
> that application, cuda is not involved.
>
> This data shows that the card is being operated at acceptable
> temperatures and that cooling is adequate.
>
> 3. The error resides in the Amber 11 Cuda port.
>
> The error does not appear in GTX200 (240 at SFSU, 295 Ross Walker?)
> series cards nor on Tesla (C1060 Genentech, C2050 Ross Walker) running
> pmemd.cuda. Thus the Amber port cannot be faulted.
>
> 4. The error appears because the system ran out of memory
>
> Ross Walker estimates that a GTX480 card should handle 95,000 atoms,
> which implies that the 470 card should handle 75,000. The fact that
> the behavior of the cards appears well below these numbers (even 20,000
> atoms on all eight 480 cards), and that the computation runs for many
> hours before crashing (including constant temperature and constant
> pressure ensembles, in production mode), shows that the problem is not
> one of running out of gpu memory. (Nor CPU associated memory either,
> 470 system has 8 GB of ram and only about a third is in use).
>
> 5. The error appears because the X server was competing with the cuda
> job.
>
> The error appears whether the X server is running or not. In tests
> that I made, I ran w/o X and also first starting the nvidia-settings
> tool via a console login that uses the card for display, then I started
> the cuda job. The behavior was the same in both cases. After about 1
> ns of production (post energy minimization and constrained MD),
> regardless of ensemble, the error appears on the 470 card at 63,000
> atoms, but never at 36,400 atoms. There appears to be system size
> dependence on the 470 card, but on the 480 cards, the error appears for
> even 20,000 atoms. The 480 cards are run in init level 3 where the X
> server can't possibly start. We conclude that the X- server
> interference is not the cause of the error.
>
> There are very few things left to blame, and the salient one is the
> driver. The Cuda driver 3.1 treats all of the cards, 200's, 400's and
> Teslas, differently. The hardware in the 470 card is the exact same
> Fermi chip as in the C2050, but its performance is turned down by the
> driver. It does not appear that the 200 series cards have this special
> treatment. The clock speed on the C2050 is 1.55 MHz, compared to 1.215
> MHz for the 470. This undoubtedly reflects the cooling capacity of the
> implementation. The temperature data above shows that the 470 cars is
> being used well within specifications and is being cooled adequately.
> The 480 card has different hardware, with 480 processors and may be
> treated differently by the driver compared to the 470 card (448
> processors), perhaps explaining why the 480 card fails even for small
> systems. ASUS is an excellent hardware manufacturer and the fact that
> all 8 of its cards showed this behavior again points to the driver, not
> the hardware or other software.
>
> I think it is time for Nvidia to take a hard look at what is going on
> here. I understand that Nvidia wants to sell Teslas and avoid consumer
> card competition in the HPC market. There is enough differentiation
> because of the memory capacities of the cards, however. Those of us
> who purchase this hardware deserve a bug free product.
>
> Thanks,
>
> Sergio Aragon
> Professor of Chemistry
> San Francisco State University
>
>
>

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sat Sep 04 2010 - 11:00:04 PDT