Many thanks Aron, Hector, and Scott,
I just wanted to follow up on your advice, and add some further experience for the record.
--------
> From: Scott Le Grand <varelse2005.gmail.com>
> Date: Wednesday, 13 March 2013 4:39 AM
> Run JAC NVE for 100,000 iterations. If it crashes, you have a bad GPU.
Scott, the single GPU portion of the JAC NVE benchmark script was precisely what I had run to make sure it wasn't just my set up leading to failure, so based on this evidence the card is, in fact, cactus.
--------
>From: "Hector A. Baldoni" <hbaldoni.unsl.edu.ar>
>Date: Wednesday, 13 March 2013 6:20 AM
> Before to decide if you have a bad GPU. You would try to install the last
> gtx460 drivers, the CUDA5.0 toolkit and recompile pmemd with gnu all patch
> included.
Hector, this was actually my next step the night before — I found a newer nVidia driver (3.10) to install, and went back to cuda toolkit 5 (I had started there, but then went back to 4.2 after reading some bit of advice to that effect, but *then* found that a recent patch in fact updated to cuda 5). All patches were (already) loaded, and I re-installed the lot (./configure -cuda gnu, make install, make test) -- no improvement. Again, seems the GPU is cactus.
At this stage, after a little digging, I found that I could control the GPU fan manually by first setting the "Coolbits" option to 5 under the GPU device in xorg.conf, then using the slider bar under the [Debian] Administration -> nVidia. Under "auto", it idled at 40% fan speed, and held a temperature of 33C with no load. when the temperature starting hitting about 67, the fan slowly ticked up to 44-45%, until the job aborted at a temperature around 71C. If I manually increased the fan speed to 80% before starting the job, the temperature seemed to hold at a steady state of about 63C, and the job ran for 45 minutes instead of ~90 seconds. It was *supposed* to run for ~3 hours, but I stopped babysitting it after 10 minutes, so I don't know at what temperature it failed.
--------
> From: Aron Broom <broomsday.gmail.com>
> Date: Wednesday, 13 March 2013 7:39 AM
> Also, if you check out the SimTK or OpenMM people's website, I believe they
> have a GPU version of the popular memtest86 application, that can allow you
> to quickly or exhaustively check your GPUs memory.
> I had found that I was often having AMBER jobs complete, but with all
> positions as NaN after a few fs on a GTX 580, but not on a 570 or M2070,
> and running that application showed the 580 had a number of bad memory
> sectors.
Aron, thanks so much for the lead on the stress test — this is exactly what I was looking for, but couldn't find to save myself. I downloaded and compiled the memtestG80 code, and ran 1000 iterations at 256 MB memory (33%) and 912 MB (~97% I think). I left the fan on 40%, and the temperature sat around 72C. The non-zero records are copied below, but the upshot is that several errors popped up, only under the "Memtest86 Modulo-20" and "Random blocks" tests. I don't know enough to interpret these in the context of the aborted Amber.cuda jobs, but I can appreciate that the card failed.
--------
Seems I need to ./configure my budget and get the best replacement I can! Many thanks once again.
Kind Regards,
John
Re the memtestG80:
The full record for most iterations is like:
Test iteration 1 (GPU 0, 256 MiB): 0 errors so far
Moving Inversions (ones and zeros): 0 errors (17 ms)
Memtest86 Walking 8-bit: 0 errors (136 ms)
True Walking zeros (8-bit): 0 errors (69 ms)
True Walking ones (8-bit): 0 errors (68 ms)
Moving Inversions (random): 0 errors (17 ms)
Memtest86 Walking zeros (32-bit): 0 errors (270 ms)
Memtest86 Walking ones (32-bit): 0 errors (248 ms)
Random blocks: 0 errors (109 ms)
Memtest86 Modulo-20: 0 errors (721 ms)
Logic (one iteration): 0 errors (15 ms)
Logic (4 iterations): 0 errors (42 ms)
Logic (shared memory, one iteration): 0 errors (24 ms)
Logic (shared-memory, 4 iterations): 0 errors (75 ms)
Here's the non-zero records from my output for the two runs (1000 iterations each, the first using 256 MB, the second using 912 MB):
------
Running 1000 iterations of tests over 256 MB of GPU memory on card 0: GeForce GTX 460
Running memory bandwidth test over 20 iterations of 128 MB transfers...
Estimated bandwidth 104489.80 MB/s
Test iteration 1 (GPU 0, 256 MiB): 0 errors so far
Test iteration 2 (GPU 0, 256 MiB): 0 errors so far
Test iteration 286 (GPU 0, 256 MiB): 0 errors so far
Memtest86 Modulo-20: 31 errors (710 ms)
Test iteration 287 (GPU 0, 256 MiB): 31 errors so far
Test iteration 479 (GPU 0, 256 MiB): 31 errors so far
Memtest86 Modulo-20: 2 errors (718 ms)
Test iteration 480 (GPU 0, 256 MiB): 33 errors so far
Test iteration 609 (GPU 0, 256 MiB): 33 errors so far
Memtest86 Modulo-20: 144 errors (713 ms)
Test iteration 610 (GPU 0, 256 MiB): 177 errors so far
Test iteration 937 (GPU 0, 256 MiB): 177 errors so far
Memtest86 Modulo-20: 13 errors (721 ms)
Test iteration 938 (GPU 0, 256 MiB): 190 errors so far
Test iteration 985 (GPU 0, 256 MiB): 190 errors so far
Random blocks: 3 errors (108 ms)
Test iteration 986 (GPU 0, 256 MiB): 193 errors so far
-------
Running 1000 iterations of tests over 912 MB of GPU memory on card 0: GeForce GTX 460
Running memory bandwidth test over 20 iterations of 456 MB transfers...
Estimated bandwidth 103050.85 MB/s
Test iteration 1 (GPU 0, 912 MiB): 0 errors so far
Test iteration 2 (GPU 0, 912 MiB): 0 errors so far
Test iteration 43 (GPU 0, 912 MiB): 0 errors so far
Memtest86 Modulo-20: 1 errors (2412 ms)
Test iteration 44 (GPU 0, 912 MiB): 1 errors so far
Test iteration 95 (GPU 0, 912 MiB): 1 errors so far
Random blocks: 1 errors (383 ms)
Test iteration 96 (GPU 0, 912 MiB): 2 errors so far
Test iteration 134 (GPU 0, 912 MiB): 2 errors so far
Random blocks: 1 errors (383 ms)
Test iteration 135 (GPU 0, 912 MiB): 3 errors so far
Test iteration 146 (GPU 0, 912 MiB): 3 errors so far
Memtest86 Modulo-20: 3 errors (2414 ms)
Test iteration 147 (GPU 0, 912 MiB): 6 errors so far
Test iteration 148 (GPU 0, 912 MiB): 6 errors so far
Memtest86 Modulo-20: 6 errors (2412 ms)
Test iteration 149 (GPU 0, 912 MiB): 12 errors so far
Test iteration 394 (GPU 0, 912 MiB): 12 errors so far
Random blocks: 2 errors (383 ms)
Test iteration 395 (GPU 0, 912 MiB): 14 errors so far
Test iteration 579 (GPU 0, 912 MiB): 14 errors so far
Random blocks: 1 errors (383 ms)
Test iteration 580 (GPU 0, 912 MiB): 15 errors so far
Test iteration 610 (GPU 0, 912 MiB): 15 errors so far
Memtest86 Modulo-20: 3 errors (2413 ms)
Test iteration 611 (GPU 0, 912 MiB): 18 errors so far
Test iteration 624 (GPU 0, 912 MiB): 18 errors so far
Random blocks: 1 errors (384 ms)
Test iteration 625 (GPU 0, 912 MiB): 19 errors so far
Test iteration 655 (GPU 0, 912 MiB): 19 errors so far
Random blocks: 1 errors (384 ms)
Test iteration 656 (GPU 0, 912 MiB): 20 errors so far
Test iteration 792 (GPU 0, 912 MiB): 20 errors so far
Random blocks: 1 errors (383 ms)
Test iteration 793 (GPU 0, 912 MiB): 21 errors so far
Test iteration 826 (GPU 0, 912 MiB): 21 errors so far
Memtest86 Modulo-20: 6 errors (2411 ms)
Test iteration 827 (GPU 0, 912 MiB): 27 errors so far
Test iteration 856 (GPU 0, 912 MiB): 27 errors so far
Random blocks: 1 errors (383 ms)
Test iteration 857 (GPU 0, 912 MiB): 28 errors so far
Test iteration 948 (GPU 0, 912 MiB): 28 errors so far
Random blocks: 1 errors (383 ms)
Test iteration 949 (GPU 0, 912 MiB): 29 errors so far
Test iteration 955 (GPU 0, 912 MiB): 29 errors so far
Random blocks: 1 errors (383 ms)
Test iteration 956 (GPU 0, 912 MiB): 30 errors so far
Test iteration 985 (GPU 0, 912 MiB): 30 errors so far
Random blocks: 1 errors (383 ms)
Test iteration 986 (GPU 0, 912 MiB): 31 errors so far
==== === == = = = = = = = = = =
John Gehman Office +61 3 8344 2417
ARC Future Fellow Fax +61 3 9347 8189
School of Chemistry Magnets +61 3 8344 2470
Bio21 Institute Mobile +61 407 536 585
30 Flemington Rd jgehman.unimelb.edu.au
Univ. of Melbourne .GehmanLab
VIC 3010 Australia
http://www2.chemistry.unimelb.edu.au/staff/jgehman/research/
"Science really suffers from bureaucracy. If we hadn't broken
every single WHO rule many times over, we would never
have defeated smallpox. Never."
-- Isao Arita, final director of the WHO smallpox eradication program
==== === == = = = = = = = = = =
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Mar 13 2013 - 07:00:04 PDT