[AMBER] JAC error reproduced 3 times on GTX 470 from Sergio R Aragon on 2010-09-09 (Amber Archive Sep 2010)

From: Sergio R Aragon <aragons.sfsu.edu>
Date: Thu, 9 Sep 2010 16:20:06 +0000

Hello All,

I've run the jac test using Ross Walker's input files 3 times on my GTX 470
card with my original Amber 11 install with Bug fixes 1-6. Here is a summary
of results:

Error: the launch timed out and was terminated launching kernel kPMEGetGridWeights

1. Failed at 120 ps; no X server running. No other jobs running on host.
2. Failed at 370 ps; X server running, nvidia-settings running. GPU temp 83 C
3. Failed at 178 ps; no X server running. No other jobs running on host.

Here is the description of my system again for easy reference:
MSI GTX 470
Amber 11 Vanilla copy with bugfixes 1 to 6 applied.
Redhat 4.8 x86_64, gfortran 4.1.2-44, nvcc 3.1 v0.2.1221, NVIDIA Driver
v256.35
Compiled Amber 11 with ./configure -cuda gnu

My temperature measurements show "moderate" operating temperatures for the card, a few degrees lower than other larger jobs that I've recently run. Does somebody really think this is a temperature issue? Nevertheless, it appears that the nvidia-settings tool does provide a way of down-clocking the card. BTW, the 470 card already runs at lower clocks that the C2050.

I am going to restart from scratch now, as suggested by Ross, with recompiling a new Amber 11 suite, with bugfix.all (patches 1-8), update the Nvidia driver to 256.53 and re run the jac tests.

I note that Christophe Deprez has just reported similar non-reproducible numbers and identical error with his 470 card on the jac run. His card is identical to mine, MSI GTX 470. It seems we are being able to consistently reproduce this error on 400 series cards.

Cheers, Sergio

-----Original Message-----
From: Ross Walker [mailto:ross.rosswalker.co.uk]
Sent: Wednesday, September 08, 2010 8:42 PM
To: 'AMBER Mailing List'
Subject: Re: [AMBER] JAC test on GTX 470 error produced/run again w/o changes/need help

Hi Sergio,

> Because there has been some question of temperatures, I am running the
> X server and the nividia-settings tool this time. Since the job is
> small I don't expect any interference, but this will allow me to record
> temperatures. I will re run the job w/o X again to eliminate any
> doubts about X server interference. The temperatures are 83 C gpu, 56
> C board, 55% fan. The job already passed the 120 ps where it crashed

This does not tell you about any specific hot spots in part of the GPU
though so I'm not sure what can be read from this. If you can somehow force
the fan to run at 100% that would be more useful. Or open the case and stick
the biggest fan you possibly can pointing straight at the GPU and see if
that helps. - Note the real test though will be if underclocking the GPU and
Memory speeds to match the C2050 helps fix the problem.

> Please help: I ran that patch with the new bugfix.all for Amber11 to
> include patches 7, and 8 that I did not have. It appears that in patch
> 4, which I had previously applied, the patch did not get skipped and I
> got a message saying:
> Patching file src/pmemd/src/cuda/gpu.cpp
> HUNK #4 Failed at 2072
> HUNK #5 Failed at 2259
> HUNK #6 Failed at 2773 out of 6 hunks failed - saving rejects in ...

My advice at this point would be to delete the whole of your amber11 tree
(or archive it somewhere). Then re-extract a vanilla copy from the
Amber11.tar.bz2 file and the AMBERTools1.4.tar.bz2 files and then apply the
AMBER 11 bugfix.all and AmberTools bugfix.all. Ultimately this will be much
more efficient than trying to debug the above issue. And will give you peace
of mind that things are patched correctly.

> At any rate, I have not recompiled Amber 11. I know I should have just
> applied patches 7,8 individually, and instead I got lazy and used the
> entire bugfix.all.

No, this 'should' work but relies on how well patch can recognize if
something is already patched. In your case though this suggests to me that
something was 'fishy' with the gpu.cpp file in the first place so
rextracting everything from the original tar files is probably safer.

All the best
Ross

/\
\/
|\oss Walker

---------------------------------------------------------
| Assistant Research Professor |
| San Diego Supercomputer Center |
| Adjunct Assistant Professor |
| Dept. of Chemistry and Biochemistry |
| University of California San Diego |
| NVIDIA Fellow |
| http://www.rosswalker.co.uk | http://www.wmd-lab.org/ |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
---------------------------------------------------------

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Sep 09 2010 - 09:30:09 PDT