yeah I saw a lot of errors with the 580 card that was giving troubles, and
0 errors on the others. Tragic, best of luck replacing it!
On Wed, Mar 13, 2013 at 9:40 AM, John Gehman <jgehman.unimelb.edu.au> wrote:
> Many thanks Aron, Hector, and Scott,
>
> I just wanted to follow up on your advice, and add some further experience
> for the record.
>
> --------
> > From: Scott Le Grand <varelse2005.gmail.com>
> > Date: Wednesday, 13 March 2013 4:39 AM
>
> > Run JAC NVE for 100,000 iterations. If it crashes, you have a bad GPU.
>
> Scott, the single GPU portion of the JAC NVE benchmark script was
> precisely what I had run to make sure it wasn't just my set up leading to
> failure, so based on this evidence the card is, in fact, cactus.
>
> --------
> >From: "Hector A. Baldoni" <hbaldoni.unsl.edu.ar>
> >Date: Wednesday, 13 March 2013 6:20 AM
>
> > Before to decide if you have a bad GPU. You would try to install the last
> > gtx460 drivers, the CUDA5.0 toolkit and recompile pmemd with gnu all
> patch
> > included.
>
> Hector, this was actually my next step the night before — I found a newer
> nVidia driver (3.10) to install, and went back to cuda toolkit 5 (I had
> started there, but then went back to 4.2 after reading some bit of advice
> to that effect, but *then* found that a recent patch in fact updated to
> cuda 5). All patches were (already) loaded, and I re-installed the lot
> (./configure -cuda gnu, make install, make test) -- no improvement. Again,
> seems the GPU is cactus.
>
> At this stage, after a little digging, I found that I could control the
> GPU fan manually by first setting the "Coolbits" option to 5 under the GPU
> device in xorg.conf, then using the slider bar under the [Debian]
> Administration -> nVidia. Under "auto", it idled at 40% fan speed, and held
> a temperature of 33C with no load. when the temperature starting hitting
> about 67, the fan slowly ticked up to 44-45%, until the job aborted at a
> temperature around 71C. If I manually increased the fan speed to 80% before
> starting the job, the temperature seemed to hold at a steady state of about
> 63C, and the job ran for 45 minutes instead of ~90 seconds. It was
> *supposed* to run for ~3 hours, but I stopped babysitting it after 10
> minutes, so I don't know at what temperature it failed.
>
> --------
> > From: Aron Broom <broomsday.gmail.com>
> > Date: Wednesday, 13 March 2013 7:39 AM
>
> > Also, if you check out the SimTK or OpenMM people's website, I believe
> they
> > have a GPU version of the popular memtest86 application, that can allow
> you
> > to quickly or exhaustively check your GPUs memory.
>
> > I had found that I was often having AMBER jobs complete, but with all
> > positions as NaN after a few fs on a GTX 580, but not on a 570 or M2070,
> > and running that application showed the 580 had a number of bad memory
> > sectors.
>
> Aron, thanks so much for the lead on the stress test — this is exactly
> what I was looking for, but couldn't find to save myself. I downloaded and
> compiled the memtestG80 code, and ran 1000 iterations at 256 MB memory
> (33%) and 912 MB (~97% I think). I left the fan on 40%, and the temperature
> sat around 72C. The non-zero records are copied below, but the upshot is
> that several errors popped up, only under the "Memtest86 Modulo-20" and
> "Random blocks" tests. I don't know enough to interpret these in the
> context of the aborted Amber.cuda jobs, but I can appreciate that the card
> failed.
>
> --------
>
> Seems I need to ./configure my budget and get the best replacement I can!
> Many thanks once again.
>
> Kind Regards,
> John
>
> Re the memtestG80:
>
> The full record for most iterations is like:
>
> Test iteration 1 (GPU 0, 256 MiB): 0 errors so far
> Moving Inversions (ones and zeros): 0 errors (17 ms)
> Memtest86 Walking 8-bit: 0 errors (136 ms)
> True Walking zeros (8-bit): 0 errors (69 ms)
> True Walking ones (8-bit): 0 errors (68 ms)
> Moving Inversions (random): 0 errors (17 ms)
> Memtest86 Walking zeros (32-bit): 0 errors (270 ms)
> Memtest86 Walking ones (32-bit): 0 errors (248 ms)
> Random blocks: 0 errors (109 ms)
> Memtest86 Modulo-20: 0 errors (721 ms)
> Logic (one iteration): 0 errors (15 ms)
> Logic (4 iterations): 0 errors (42 ms)
> Logic (shared memory, one iteration): 0 errors (24 ms)
> Logic (shared-memory, 4 iterations): 0 errors (75 ms)
>
> Here's the non-zero records from my output for the two runs (1000
> iterations each, the first using 256 MB, the second using 912 MB):
>
> ------
>
> Running 1000 iterations of tests over 256 MB of GPU memory on card 0:
> GeForce GTX 460
>
> Running memory bandwidth test over 20 iterations of 128 MB transfers...
> Estimated bandwidth 104489.80 MB/s
>
> Test iteration 1 (GPU 0, 256 MiB): 0 errors so far
> Test iteration 2 (GPU 0, 256 MiB): 0 errors so far
> Test iteration 286 (GPU 0, 256 MiB): 0 errors so far
> Memtest86 Modulo-20: 31 errors (710 ms)
>
> Test iteration 287 (GPU 0, 256 MiB): 31 errors so far
> Test iteration 479 (GPU 0, 256 MiB): 31 errors so far
> Memtest86 Modulo-20: 2 errors (718 ms)
>
> Test iteration 480 (GPU 0, 256 MiB): 33 errors so far
> Test iteration 609 (GPU 0, 256 MiB): 33 errors so far
> Memtest86 Modulo-20: 144 errors (713 ms)
>
> Test iteration 610 (GPU 0, 256 MiB): 177 errors so far
> Test iteration 937 (GPU 0, 256 MiB): 177 errors so far
> Memtest86 Modulo-20: 13 errors (721 ms)
>
> Test iteration 938 (GPU 0, 256 MiB): 190 errors so far
> Test iteration 985 (GPU 0, 256 MiB): 190 errors so far
> Random blocks: 3 errors (108 ms)
>
> Test iteration 986 (GPU 0, 256 MiB): 193 errors so far
>
> -------
> Running 1000 iterations of tests over 912 MB of GPU memory on card 0:
> GeForce GTX 460
>
> Running memory bandwidth test over 20 iterations of 456 MB transfers...
> Estimated bandwidth 103050.85 MB/s
>
> Test iteration 1 (GPU 0, 912 MiB): 0 errors so far
> Test iteration 2 (GPU 0, 912 MiB): 0 errors so far
> Test iteration 43 (GPU 0, 912 MiB): 0 errors so far
> Memtest86 Modulo-20: 1 errors (2412 ms)
>
> Test iteration 44 (GPU 0, 912 MiB): 1 errors so far
> Test iteration 95 (GPU 0, 912 MiB): 1 errors so far
> Random blocks: 1 errors (383 ms)
>
> Test iteration 96 (GPU 0, 912 MiB): 2 errors so far
> Test iteration 134 (GPU 0, 912 MiB): 2 errors so far
> Random blocks: 1 errors (383 ms)
>
> Test iteration 135 (GPU 0, 912 MiB): 3 errors so far
> Test iteration 146 (GPU 0, 912 MiB): 3 errors so far
> Memtest86 Modulo-20: 3 errors (2414 ms)
>
> Test iteration 147 (GPU 0, 912 MiB): 6 errors so far
> Test iteration 148 (GPU 0, 912 MiB): 6 errors so far
> Memtest86 Modulo-20: 6 errors (2412 ms)
>
> Test iteration 149 (GPU 0, 912 MiB): 12 errors so far
> Test iteration 394 (GPU 0, 912 MiB): 12 errors so far
> Random blocks: 2 errors (383 ms)
>
> Test iteration 395 (GPU 0, 912 MiB): 14 errors so far
> Test iteration 579 (GPU 0, 912 MiB): 14 errors so far
> Random blocks: 1 errors (383 ms)
>
> Test iteration 580 (GPU 0, 912 MiB): 15 errors so far
> Test iteration 610 (GPU 0, 912 MiB): 15 errors so far
> Memtest86 Modulo-20: 3 errors (2413 ms)
>
> Test iteration 611 (GPU 0, 912 MiB): 18 errors so far
> Test iteration 624 (GPU 0, 912 MiB): 18 errors so far
> Random blocks: 1 errors (384 ms)
>
> Test iteration 625 (GPU 0, 912 MiB): 19 errors so far
> Test iteration 655 (GPU 0, 912 MiB): 19 errors so far
> Random blocks: 1 errors (384 ms)
>
> Test iteration 656 (GPU 0, 912 MiB): 20 errors so far
> Test iteration 792 (GPU 0, 912 MiB): 20 errors so far
> Random blocks: 1 errors (383 ms)
>
> Test iteration 793 (GPU 0, 912 MiB): 21 errors so far
> Test iteration 826 (GPU 0, 912 MiB): 21 errors so far
> Memtest86 Modulo-20: 6 errors (2411 ms)
>
> Test iteration 827 (GPU 0, 912 MiB): 27 errors so far
> Test iteration 856 (GPU 0, 912 MiB): 27 errors so far
> Random blocks: 1 errors (383 ms)
>
> Test iteration 857 (GPU 0, 912 MiB): 28 errors so far
> Test iteration 948 (GPU 0, 912 MiB): 28 errors so far
> Random blocks: 1 errors (383 ms)
>
> Test iteration 949 (GPU 0, 912 MiB): 29 errors so far
> Test iteration 955 (GPU 0, 912 MiB): 29 errors so far
> Random blocks: 1 errors (383 ms)
>
> Test iteration 956 (GPU 0, 912 MiB): 30 errors so far
> Test iteration 985 (GPU 0, 912 MiB): 30 errors so far
> Random blocks: 1 errors (383 ms)
>
> Test iteration 986 (GPU 0, 912 MiB): 31 errors so far
>
>
>
> ==== === == = = = = = = = = = =
> John Gehman Office +61 3 8344 2417
> ARC Future Fellow Fax +61 3 9347 8189
> School of Chemistry Magnets +61 3 8344 2470
> Bio21 Institute Mobile +61 407 536 585
> 30 Flemington Rd jgehman.unimelb.edu.au
> Univ. of Melbourne .GehmanLab
> VIC 3010 Australia
> http://www2.chemistry.unimelb.edu.au/staff/jgehman/research/
>
> "Science really suffers from bureaucracy. If we hadn't broken
> every single WHO rule many times over, we would never
> have defeated smallpox. Never."
> -- Isao Arita, final director of the WHO smallpox eradication program
>
> ==== === == = = = = = = = = = =
>
>
>
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
--
Aron Broom M.Sc
PhD Student
Department of Chemistry
University of Waterloo
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Mar 13 2013 - 08:00:03 PDT