[AMBER] debug run from Sergio R Aragon on 2010-09-10 (Amber Archive Sep 2010)

From: Sergio R Aragon <aragons.sfsu.edu>
Date: Fri, 10 Sep 2010 19:03:20 +0000

Hi Scott,

When I attempt your instructions in my machine, I get the output:
(no debugging symbols found)
Don't we need to recompile pmemd.cuda with a debug option or something like that before running this test?

Sergio

-----Original Message-----
From: Scott Le Grand [mailto:SLeGrand.nvidia.com]
Sent: Friday, September 10, 2010 11:33 AM
To: AMBER Mailing List
Subject: Re: [AMBER] GPU memory test program

Here's one thing you guys can all do:

Take JAC and:

cuda-gdb pmemd.cuda
run -O -i mdin -c inpcrd

(assuming mdin is your input file and inpcrd is the coordinate input file)

I'm doing this now and waiting for it to trip up...

-----Original Message-----
From: Sergio R Aragon [mailto:aragons.sfsu.edu]
Sent: Friday, September 10, 2010 10:18
To: AMBER Mailing List
Subject: [AMBER] GPU memory test program

Hello Mark,

Thanks for sharing the information on that useful GPU memory checking tool. This will be useful in debugging hardware conditions when isolated errors occur. The case we have here, with 10 different GTX 480 and 4 GTX 470 cards is pointing to something serious in either hardware or driver in these cards. It it's hardware it could be a memory issue, so Scott Le Grand may find this tool useful in his further work on the problem. Being at Nvidia, he probably has access to a whole series of tools that we don't know about... but I'm going to try compiling it with your instructions.

Thanks again.

Sergio

-----Original Message-----
From: Mark Williamson [mailto:mjw.mjw.name]
Sent: Friday, September 10, 2010 8:15 AM
To: AMBER Mailing List
Subject: Re: [AMBER] JAC error reproduced 3 times on GTX 470

On 09/09/2010 17:20, Sergio R Aragon wrote:
> Hello All,
>
> I've run the jac test using Ross Walker's input files 3 times on my GTX 470
> card with my original Amber 11 install with Bug fixes 1-6. Here is a summary
> of results:
>
> Error: the launch timed out and was terminated launching kernel kPMEGetGridWeights
>
> 1. Failed at 120 ps; no X server running. No other jobs running on host.
> 2. Failed at 370 ps; X server running, nvidia-settings running. GPU temp 83 C
> 3. Failed at 178 ps; no X server running. No other jobs running on host.
>
> Here is the description of my system again for easy reference:
> MSI GTX 470
> Amber 11 Vanilla copy with bugfixes 1 to 6 applied.
> Redhat 4.8 x86_64, gfortran 4.1.2-44, nvcc 3.1 v0.2.1221, NVIDIA Driver
> v256.35
> Compiled Amber 11 with ./configure -cuda gnu
>
> My temperature measurements show "moderate" operating temperatures for the card, a few degrees lower than other larger jobs that I've recently run. Does somebody really think this is a temperature issue? Nevertheless, it appears that the nvidia-settings tool does provide a way of down-clocking the card. BTW, the 470 card already runs at lower clocks that the C2050.
>
> I am going to restart from scratch now, as suggested by Ross, with recompiling a new Amber 11 suite, with bugfix.all (patches 1-8), update the Nvidia driver to 256.53 and re run the jac tests.
>
> I note that Christophe Deprez has just reported similar non-reproducible numbers and identical error with his 470 card on the jac run. His card is identical to mine, MSI GTX 470. It seems we are being able to consistently reproduce this error on 400 series cards.
>
> Cheers, Sergio

Hi all,

Just a quick thought whilst reading this thread. In the past on standard
PCs, when encountering random crashes with code that was known to
function fine on other machines, I would always attempt to rule out
certain aspects of the hardware. I'd usually start with the memory by
using a software memory tester such as memtest86+, and running it overnight.

Imran S Haque, from I think Vijay Pande's group, has written a useful,
analogous tool for GPUs. This can be found at:

http://www.cs.stanford.edu/people/ihaque/papers/gpuser.pdf
https://simtk.org/project/xml/downloads.xml?group_id=385
( b61149bae88bb5398877b8c00d428bfc memtestG80-1.1-src.tar.gz )

I was able to eventually compile this on a RHEL 4.8 X86_86 box; however
I needed to modify the "Makefile.linux64" makefile, changing the
following variable as follows:

  POPT_DIR=/usr/lib64/

to avoid a "undefined reference to `__stack_chk_fail'" linking error.

Then, the tool can finally compiled with
        make -f Makefile.linux64

and then run as follows:

[07:57][bunny:1.11][mjw:memtestG80-1.1]$ ./memtestG80
      -------------------------------------------------------------
      | MemtestG80 v1.00 |
      | |
      | Usage: memtestG80 [flags] [MB GPU RAM to test] [# iters] |
      | |
      | Defaults: GPU 0, 128MB RAM, 50 test iterations |
      | Amount of tested RAM will be rounded up to nearest 2MB |
      -------------------------------------------------------------

       Available flags:
         --gpu N ,-g N : run test on the Nth (from 0) CUDA GPU
         --license ,-l : show license terms for this build

Running 50 iterations of tests over 128 MB of GPU memory on card 0:
Tesla C2050

Running memory bandwidth test over 20 iterations of 64 MB transfers...
         Estimated bandwidth 69189.19 MB/s

Test iteration 1 (GPU 0, 128 MiB): 0 errors so far
         Moving Inversions (ones and zeros): 0 errors (25 ms)
         Memtest86 Walking 8-bit: 0 errors (172 ms)
         True Walking zeros (8-bit): 0 errors (87 ms)
         True Walking ones (8-bit): 0 errors (84 ms)
         Moving Inversions (random): 0 errors (25 ms)
         Memtest86 Walking zeros (32-bit): 0 errors (364 ms)
         Memtest86 Walking ones (32-bit): 0 errors (369 ms)
         Random blocks: 0 errors (57 ms)
         Memtest86 Modulo-20: 0 errors (784 ms)
         Logic (one iteration): 0 errors (22 ms)
         Logic (4 iterations): 0 errors (46 ms)
         Logic (shared memory, one iteration): 0 errors (18 ms)
         Logic (shared-memory, 4 iterations): 0 errors (65 ms)

Test iteration 2 (GPU 0, 128 MiB): 0 errors so far
.....etc...

It would be interesting to see if this tool "finds" anything on the
GTX4xx cards which are displaying the issues with pmemd.cuda discussed
in this thread and others. This may offer a route to localise the
problem and facilitate faster development of a solution.

regards,

Mark

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Sep 10 2010 - 12:30:03 PDT