On 09/09/2010 17:20, Sergio R Aragon wrote:
> Hello All,
>
> I've run the jac test using Ross Walker's input files 3 times on my GTX 470
> card with my original Amber 11 install with Bug fixes 1-6. Here is a summary
> of results:
>
> Error: the launch timed out and was terminated launching kernel kPMEGetGridWeights
>
> 1. Failed at 120 ps; no X server running. No other jobs running on host.
> 2. Failed at 370 ps; X server running, nvidia-settings running. GPU temp 83 C
> 3. Failed at 178 ps; no X server running. No other jobs running on host.
>
> Here is the description of my system again for easy reference:
> MSI GTX 470
> Amber 11 Vanilla copy with bugfixes 1 to 6 applied.
> Redhat 4.8 x86_64, gfortran 4.1.2-44, nvcc 3.1 v0.2.1221, NVIDIA Driver
> v256.35
> Compiled Amber 11 with ./configure -cuda gnu
>
> My temperature measurements show "moderate" operating temperatures for the card, a few degrees lower than other larger jobs that I've recently run. Does somebody really think this is a temperature issue? Nevertheless, it appears that the nvidia-settings tool does provide a way of down-clocking the card. BTW, the 470 card already runs at lower clocks that the C2050.
>
> I am going to restart from scratch now, as suggested by Ross, with recompiling a new Amber 11 suite, with bugfix.all (patches 1-8), update the Nvidia driver to 256.53 and re run the jac tests.
>
> I note that Christophe Deprez has just reported similar non-reproducible numbers and identical error with his 470 card on the jac run. His card is identical to mine, MSI GTX 470. It seems we are being able to consistently reproduce this error on 400 series cards.
>
> Cheers, Sergio
Hi all,
Just a quick thought whilst reading this thread. In the past on standard
PCs, when encountering random crashes with code that was known to
function fine on other machines, I would always attempt to rule out
certain aspects of the hardware. I'd usually start with the memory by
using a software memory tester such as memtest86+, and running it overnight.
Imran S Haque, from I think Vijay Pande's group, has written a useful,
analogous tool for GPUs. This can be found at:
http://www.cs.stanford.edu/people/ihaque/papers/gpuser.pdf
https://simtk.org/project/xml/downloads.xml?group_id=385
( b61149bae88bb5398877b8c00d428bfc memtestG80-1.1-src.tar.gz )
I was able to eventually compile this on a RHEL 4.8 X86_86 box; however
I needed to modify the "Makefile.linux64" makefile, changing the
following variable as follows:
POPT_DIR=/usr/lib64/
to avoid a "undefined reference to `__stack_chk_fail'" linking error.
Then, the tool can finally compiled with
make -f Makefile.linux64
and then run as follows:
[07:57][bunny:1.11][mjw:memtestG80-1.1]$ ./memtestG80
-------------------------------------------------------------
| MemtestG80 v1.00 |
| |
| Usage: memtestG80 [flags] [MB GPU RAM to test] [# iters] |
| |
| Defaults: GPU 0, 128MB RAM, 50 test iterations |
| Amount of tested RAM will be rounded up to nearest 2MB |
-------------------------------------------------------------
Available flags:
--gpu N ,-g N : run test on the Nth (from 0) CUDA GPU
--license ,-l : show license terms for this build
Running 50 iterations of tests over 128 MB of GPU memory on card 0:
Tesla C2050
Running memory bandwidth test over 20 iterations of 64 MB transfers...
Estimated bandwidth 69189.19 MB/s
Test iteration 1 (GPU 0, 128 MiB): 0 errors so far
Moving Inversions (ones and zeros): 0 errors (25 ms)
Memtest86 Walking 8-bit: 0 errors (172 ms)
True Walking zeros (8-bit): 0 errors (87 ms)
True Walking ones (8-bit): 0 errors (84 ms)
Moving Inversions (random): 0 errors (25 ms)
Memtest86 Walking zeros (32-bit): 0 errors (364 ms)
Memtest86 Walking ones (32-bit): 0 errors (369 ms)
Random blocks: 0 errors (57 ms)
Memtest86 Modulo-20: 0 errors (784 ms)
Logic (one iteration): 0 errors (22 ms)
Logic (4 iterations): 0 errors (46 ms)
Logic (shared memory, one iteration): 0 errors (18 ms)
Logic (shared-memory, 4 iterations): 0 errors (65 ms)
Test iteration 2 (GPU 0, 128 MiB): 0 errors so far
.....etc...
It would be interesting to see if this tool "finds" anything on the
GTX4xx cards which are displaying the issues with pmemd.cuda discussed
in this thread and others. This may offer a route to localise the
problem and facilitate faster development of a solution.
regards,
Mark
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Sep 10 2010 - 08:30:03 PDT