Re: [AMBER] Memory test con GTX 470 from Scott Le Grand on 2010-09-12 (Amber Archive Sep 2010)

From: Scott Le Grand <SLeGrand.nvidia.com>
Date: Sun, 12 Sep 2010 20:15:11 -0700

set cuda kernel_events 0

in cuda-gdb

-----Original Message-----
From: Sergio R Aragon [mailto:aragons.sfsu.edu]
Sent: Sunday, September 12, 2010 08:01
To: AMBER Mailing List
Subject: Re: [AMBER] Memory test con GTX 470

Hello Scott & other interested parties,

Just reporting that the memory test ran 15,000 iterations with ZERO errors in the course of two days on my GTX 470. Since the kernel launch time out errors occur every few hours, it may be safe to say that the memory does not seem to be the source of the error.

Regarding the debug run, is there a way to run that test w/o getting the line
....
[Launch of CUDA Kernel 13463 on Device 0]
[Termination of CUDA Kernel 13463 on Device 0]
[Launch of CUDA Kernel 13464 on Device 0]
[Termination of CUDA Kernel 13464 on Device 0]
[Launch of CUDA Kernel 13465 on Device 0]
[Termination of CUDA Kernel 13465 on Device 0]
---Type <return> to continue, or q <return> to quit---

I can't sit there and type return for hours and hours. Thanks,

Sergio

-----Original Message-----
From: Scott Le Grand [mailto:SLeGrand.nvidia.com]
Sent: Friday, September 10, 2010 1:35 PM
To: AMBER Mailing List
Subject: Re: [AMBER] debug run

Does it still run the test?

If so, then just let it go...

-----Original Message-----
From: Jason Swails [mailto:jason.swails.gmail.com]
Sent: Friday, September 10, 2010 12:26
To: AMBER Mailing List
Subject: Re: [AMBER] debug run

you probably need to recompile with -O0 and -g passed to the compilers to
remove optimizations and add debug symbols (maybe the latter implies the
first... not sure).

On Fri, Sep 10, 2010 at 3:03 PM, Sergio R Aragon <aragons.sfsu.edu> wrote:

> Hi Scott,
>
> When I attempt your instructions in my machine, I get the output:
> (no debugging symbols found)
> Don't we need to recompile pmemd.cuda with a debug option or something
> like that before running this test?
>
> Sergio
>
>
> -----Original Message-----
> From: Scott Le Grand [mailto:SLeGrand.nvidia.com]
> Sent: Friday, September 10, 2010 11:33 AM
> To: AMBER Mailing List
> Subject: Re: [AMBER] GPU memory test program
>
> Here's one thing you guys can all do:
>
>
> Take JAC and:
>
> cuda-gdb pmemd.cuda
> run -O -i mdin -c inpcrd
>
> (assuming mdin is your input file and inpcrd is the coordinate input file)
>
> I'm doing this now and waiting for it to trip up...
>
>
> -----Original Message-----
> From: Sergio R Aragon [mailto:aragons.sfsu.edu]
> Sent: Friday, September 10, 2010 10:18
> To: AMBER Mailing List
> Subject: [AMBER] GPU memory test program
>
> Hello Mark,
>
> Thanks for sharing the information on that useful GPU memory checking tool.
> This will be useful in debugging hardware conditions when isolated errors
> occur. The case we have here, with 10 different GTX 480 and 4 GTX 470 cards
> is pointing to something serious in either hardware or driver in these
> cards. It it's hardware it could be a memory issue, so Scott Le Grand may
> find this tool useful in his further work on the problem. Being at Nvidia,
> he probably has access to a whole series of tools that we don't know
> about... but I'm going to try compiling it with your instructions.
>
> Thanks again.
>
> Sergio
>
> -----Original Message-----
> From: Mark Williamson [mailto:mjw.mjw.name]
> Sent: Friday, September 10, 2010 8:15 AM
> To: AMBER Mailing List
> Subject: Re: [AMBER] JAC error reproduced 3 times on GTX 470
>
> On 09/09/2010 17:20, Sergio R Aragon wrote:
> > Hello All,
> >
> > I've run the jac test using Ross Walker's input files 3 times on my GTX
> 470
> > card with my original Amber 11 install with Bug fixes 1-6. Here is a
> summary
> > of results:
> >
> > Error: the launch timed out and was terminated launching kernel
> kPMEGetGridWeights
> >
> > 1. Failed at 120 ps; no X server running. No other jobs running on host.
> > 2. Failed at 370 ps; X server running, nvidia-settings running. GPU temp
> 83 C
> > 3. Failed at 178 ps; no X server running. No other jobs running on host.
> >
> > Here is the description of my system again for easy reference:
> > MSI GTX 470
> > Amber 11 Vanilla copy with bugfixes 1 to 6 applied.
> > Redhat 4.8 x86_64, gfortran 4.1.2-44, nvcc 3.1 v0.2.1221, NVIDIA Driver
> > v256.35
> > Compiled Amber 11 with ./configure -cuda gnu
> >
> > My temperature measurements show "moderate" operating temperatures for
> the card, a few degrees lower than other larger jobs that I've recently run.
> Does somebody really think this is a temperature issue? Nevertheless, it
> appears that the nvidia-settings tool does provide a way of down-clocking
> the card. BTW, the 470 card already runs at lower clocks that the C2050.
> >
> > I am going to restart from scratch now, as suggested by Ross, with
> recompiling a new Amber 11 suite, with bugfix.all (patches 1-8), update the
> Nvidia driver to 256.53 and re run the jac tests.
> >
> > I note that Christophe Deprez has just reported similar non-reproducible
> numbers and identical error with his 470 card on the jac run. His card is
> identical to mine, MSI GTX 470. It seems we are being able to consistently
> reproduce this error on 400 series cards.
> >
> > Cheers, Sergio
>
> Hi all,
>
> Just a quick thought whilst reading this thread. In the past on standard
> PCs, when encountering random crashes with code that was known to
> function fine on other machines, I would always attempt to rule out
> certain aspects of the hardware. I'd usually start with the memory by
> using a software memory tester such as memtest86+, and running it
> overnight.
>
> Imran S Haque, from I think Vijay Pande's group, has written a useful,
> analogous tool for GPUs. This can be found at:
>
> http://www.cs.stanford.edu/people/ihaque/papers/gpuser.pdf
> https://simtk.org/project/xml/downloads.xml?group_id=385
> ( b61149bae88bb5398877b8c00d428bfc memtestG80-1.1-src.tar.gz )
>
> I was able to eventually compile this on a RHEL 4.8 X86_86 box; however
> I needed to modify the "Makefile.linux64" makefile, changing the
> following variable as follows:
>
> POPT_DIR=/usr/lib64/
>
> to avoid a "undefined reference to `__stack_chk_fail'" linking error.
>
> Then, the tool can finally compiled with
> make -f Makefile.linux64
>
> and then run as follows:
>
> [07:57][bunny:1.11][mjw:memtestG80-1.1]$ ./memtestG80
> -------------------------------------------------------------
> | MemtestG80 v1.00 |
> | |
> | Usage: memtestG80 [flags] [MB GPU RAM to test] [# iters] |
> | |
> | Defaults: GPU 0, 128MB RAM, 50 test iterations |
> | Amount of tested RAM will be rounded up to nearest 2MB |
> -------------------------------------------------------------
>
> Available flags:
> --gpu N ,-g N : run test on the Nth (from 0) CUDA GPU
> --license ,-l : show license terms for this build
>
> Running 50 iterations of tests over 128 MB of GPU memory on card 0:
> Tesla C2050
>
> Running memory bandwidth test over 20 iterations of 64 MB transfers...
> Estimated bandwidth 69189.19 MB/s
>
> Test iteration 1 (GPU 0, 128 MiB): 0 errors so far
> Moving Inversions (ones and zeros): 0 errors (25 ms)
> Memtest86 Walking 8-bit: 0 errors (172 ms)
> True Walking zeros (8-bit): 0 errors (87 ms)
> True Walking ones (8-bit): 0 errors (84 ms)
> Moving Inversions (random): 0 errors (25 ms)
> Memtest86 Walking zeros (32-bit): 0 errors (364 ms)
> Memtest86 Walking ones (32-bit): 0 errors (369 ms)
> Random blocks: 0 errors (57 ms)
> Memtest86 Modulo-20: 0 errors (784 ms)
> Logic (one iteration): 0 errors (22 ms)
> Logic (4 iterations): 0 errors (46 ms)
> Logic (shared memory, one iteration): 0 errors (18 ms)
> Logic (shared-memory, 4 iterations): 0 errors (65 ms)
>
> Test iteration 2 (GPU 0, 128 MiB): 0 errors so far
> .....etc...
>
> It would be interesting to see if this tool "finds" anything on the
> GTX4xx cards which are displaying the issues with pmemd.cuda discussed
> in this thread and others. This may offer a route to localise the
> problem and facilitate faster development of a solution.
>
> regards,
>
> Mark
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
> -----------------------------------------------------------------------------------
> This email message is for the sole use of the intended recipient(s) and may
> contain
> confidential information. Any unauthorized review, use, disclosure or
> distribution
> is prohibited. If you are not the intended recipient, please contact the
> sender by
> reply email and destroy all copies of the original message.
>
> -----------------------------------------------------------------------------------
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>

-- 
Jason M. Swails
Quantum Theory Project,
University of Florida
Ph.D. Graduate Student
352-392-4032
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Sun Sep 12 2010 - 20:30:03 PDT