Re: [AMBER] debug run from Sergio R Aragon on 2010-09-10 (Amber Archive Sep 2010)

From: Sergio R Aragon <aragons.sfsu.edu>
Date: Fri, 10 Sep 2010 22:36:35 +0000

Hi Scott,

It may have been running still and I killed it. I'm presently running 15000 cycles of memtestG80 on 1 GB of memory in my GTX 470. This will finish after the weekend. When the memtest finishes, I will try to run the debugger again w/o recompiling and letting it go. I think the memory test results could be useful, at least to discard the possibility of the card having "hard" memory errors. Thanks,

Sergio

Memtest: on 1747 iterations, no errors so far.

-----Original Message-----
From: Scott Le Grand [mailto:SLeGrand.nvidia.com]
Sent: Friday, September 10, 2010 1:35 PM
To: AMBER Mailing List
Subject: Re: [AMBER] debug run

Does it still run the test?

If so, then just let it go...

-----Original Message-----
From: Jason Swails [mailto:jason.swails.gmail.com]
Sent: Friday, September 10, 2010 12:26
To: AMBER Mailing List
Subject: Re: [AMBER] debug run

you probably need to recompile with -O0 and -g passed to the compilers to
remove optimizations and add debug symbols (maybe the latter implies the
first... not sure).

On Fri, Sep 10, 2010 at 3:03 PM, Sergio R Aragon <aragons.sfsu.edu> wrote:

> Hi Scott,
>
> When I attempt your instructions in my machine, I get the output:
> (no debugging symbols found)
> Don't we need to recompile pmemd.cuda with a debug option or something
> like that before running this test?
>
> Sergio
>
>
> -----Original Message-----
> From: Scott Le Grand [mailto:SLeGrand.nvidia.com]
> Sent: Friday, September 10, 2010 11:33 AM
> To: AMBER Mailing List
> Subject: Re: [AMBER] GPU memory test program
>
> Here's one thing you guys can all do:
>
>
> Take JAC and:
>
> cuda-gdb pmemd.cuda
> run -O -i mdin -c inpcrd
>
> (assuming mdin is your input file and inpcrd is the coordinate input file)
>
> I'm doing this now and waiting for it to trip up...
>
>
> -----Original Message-----
> From: Sergio R Aragon [mailto:aragons.sfsu.edu]
> Sent: Friday, September 10, 2010 10:18
> To: AMBER Mailing List
> Subject: [AMBER] GPU memory test program
>
> Hello Mark,
>
> Thanks for sharing the information on that useful GPU memory checking tool.
> This will be useful in debugging hardware conditions when isolated errors
> occur. The case we have here, with 10 different GTX 480 and 4 GTX 470 cards
> is pointing to something serious in either hardware or driver in these
> cards. It it's hardware it could be a memory issue, so Scott Le Grand may
> find this tool useful in his further work on the problem. Being at Nvidia,
> he probably has access to a whole series of tools that we don't know
> about... but I'm going to try compiling it with your instructions.
>
> Thanks again.
>
> Sergio
>
> -----Original Message-----
> From: Mark Williamson [mailto:mjw.mjw.name]
> Sent: Friday, September 10, 2010 8:15 AM
> To: AMBER Mailing List
> Subject: Re: [AMBER] JAC error reproduced 3 times on GTX 470
>
> On 09/09/2010 17:20, Sergio R Aragon wrote:
> > Hello All,
> >
> > I've run the jac test using Ross Walker's input files 3 times on my GTX
> 470
> > card with my original Amber 11 install with Bug fixes 1-6. Here is a
> summary
> > of results:
> >
> > Error: the launch timed out and was terminated launching kernel
> kPMEGetGridWeights
> >
> > 1. Failed at 120 ps; no X server running. No other jobs running on host.
> > 2. Failed at 370 ps; X server running, nvidia-settings running. GPU temp
> 83 C
> > 3. Failed at 178 ps; no X server running. No other jobs running on host.
> >
> > Here is the description of my system again for easy reference:
> > MSI GTX 470
> > Amber 11 Vanilla copy with bugfixes 1 to 6 applied.
> > Redhat 4.8 x86_64, gfortran 4.1.2-44, nvcc 3.1 v0.2.1221, NVIDIA Driver
> > v256.35
> > Compiled Amber 11 with ./configure -cuda gnu
> >
> > My temperature measurements show "moderate" operating temperatures for
> the card, a few degrees lower than other larger jobs that I've recently run.
> Does somebody really think this is a temperature issue? Nevertheless, it
> appears that the nvidia-settings tool does provide a way of down-clocking
> the card. BTW, the 470 card already runs at lower clocks that the C2050.
> >
> > I am going to restart from scratch now, as suggested by Ross, with
> recompiling a new Amber 11 suite, with bugfix.all (patches 1-8), update the
> Nvidia driver to 256.53 and re run the jac tests.
> >
> > I note that Christophe Deprez has just reported similar non-reproducible
> numbers and identical error with his 470 card on the jac run. His card is
> identical to mine, MSI GTX 470. It seems we are being able to consistently
> reproduce this error on 400 series cards.
> >
> > Cheers, Sergio
>
> Hi all,
>
> Just a quick thought whilst reading this thread. In the past on standard
> PCs, when encountering random crashes with code that was known to
> function fine on other machines, I would always attempt to rule out
> certain aspects of the hardware. I'd usually start with the memory by
> using a software memory tester such as memtest86+, and running it
> overnight.
>
> Imran S Haque, from I think Vijay Pande's group, has written a useful,
> analogous tool for GPUs. This can be found at:
>
> http://www.cs.stanford.edu/people/ihaque/papers/gpuser.pdf
> https://simtk.org/project/xml/downloads.xml?group_id=385
> ( b61149bae88bb5398877b8c00d428bfc memtestG80-1.1-src.tar.gz )
>
> I was able to eventually compile this on a RHEL 4.8 X86_86 box; however
> I needed to modify the "Makefile.linux64" makefile, changing the
> following variable as follows:
>
> POPT_DIR=/usr/lib64/
>
> to avoid a "undefined reference to `__stack_chk_fail'" linking error.
>
> Then, the tool can finally compiled with
> make -f Makefile.linux64
>
> and then run as follows:
>
> [07:57][bunny:1.11][mjw:memtestG80-1.1]$ ./memtestG80
> -------------------------------------------------------------
> | MemtestG80 v1.00 |
> | |
> | Usage: memtestG80 [flags] [MB GPU RAM to test] [# iters] |
> | |
> | Defaults: GPU 0, 128MB RAM, 50 test iterations |
> | Amount of tested RAM will be rounded up to nearest 2MB |
> -------------------------------------------------------------
>
> Available flags:
> --gpu N ,-g N : run test on the Nth (from 0) CUDA GPU
> --license ,-l : show license terms for this build
>
> Running 50 iterations of tests over 128 MB of GPU memory on card 0:
> Tesla C2050
>
> Running memory bandwidth test over 20 iterations of 64 MB transfers...
> Estimated bandwidth 69189.19 MB/s
>
> Test iteration 1 (GPU 0, 128 MiB): 0 errors so far
> Moving Inversions (ones and zeros): 0 errors (25 ms)
> Memtest86 Walking 8-bit: 0 errors (172 ms)
> True Walking zeros (8-bit): 0 errors (87 ms)
> True Walking ones (8-bit): 0 errors (84 ms)
> Moving Inversions (random): 0 errors (25 ms)
> Memtest86 Walking zeros (32-bit): 0 errors (364 ms)
> Memtest86 Walking ones (32-bit): 0 errors (369 ms)
> Random blocks: 0 errors (57 ms)
> Memtest86 Modulo-20: 0 errors (784 ms)
> Logic (one iteration): 0 errors (22 ms)
> Logic (4 iterations): 0 errors (46 ms)
> Logic (shared memory, one iteration): 0 errors (18 ms)
> Logic (shared-memory, 4 iterations): 0 errors (65 ms)
>
> Test iteration 2 (GPU 0, 128 MiB): 0 errors so far
> .....etc...
>
> It would be interesting to see if this tool "finds" anything on the
> GTX4xx cards which are displaying the issues with pmemd.cuda discussed
> in this thread and others. This may offer a route to localise the
> problem and facilitate faster development of a solution.
>
> regards,
>
> Mark
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
> -----------------------------------------------------------------------------------
> This email message is for the sole use of the intended recipient(s) and may
> contain
> confidential information. Any unauthorized review, use, disclosure or
> distribution
> is prohibited. If you are not the intended recipient, please contact the
> sender by
> reply email and destroy all copies of the original message.
>
> -----------------------------------------------------------------------------------
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>

-- 
Jason M. Swails
Quantum Theory Project,
University of Florida
Ph.D. Graduate Student
352-392-4032
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Fri Sep 10 2010 - 16:00:02 PDT