Many Thanks Jason, Hector, and Ross,
To answer Ross's questions:
-- No, I cannot find any error messages anywhere — I've checked md.out files, /var/log files, monitored nvidia-smi, no evidence of any problems.
-- I've confirmed that the fan is running fine, however I think it's probably correct that the fault is temperature related: I tested again this morning, at slightly cooler ambient temperature than the tests I reported earlier which were run later in the day (during a general heat wave here in Australia) -- the md ran longer this time (2-3 minutes), but I think failed at similar temperatures (last caught temps with manual nvidia-smi updates before failure were 67-70C)
-- The card does not fall off the bus — no reboot required, and from what I can find on the web, I believe there should be a log entry in /var/log if I were to suffer such an event.
-- I'll probe a bit further into the results to look for "crazy". The cuda tests reported no bona fide errors, and 7/88 "possible failures", all of which were "Maximum * error …" messages for differences in the last digit of specified values; all the tests fundamentally ran and completed, though.
CUDA-Z runs fine for as long as I've left it go (longer than the two minutes that it fails running AMBER), although the temperature doesn't hit the same level.
Certainly please let me know if the above follow-up sheds any more light on the matter, but it all sounds fairly likely that I've got a dodgy card, and buying a replacement is warranted. I take your point, Jason, that quality/reliability and performance may *both* scale with the model selected, even if Hector got lucky. Maybe I need to have another look down the back of the sofa before going shopping. Many thanks for your help!
Kind Regards,
John
==== === == = = = = = = = = = =
John Gehman Office +61 3 8344 2417
ARC Future Fellow Fax +61 3 9347 8189
School of Chemistry Magnets +61 3 8344 2470
Bio21 Institute Mobile +61 407 536 585
30 Flemington Rd jgehman.unimelb.edu.au
Univ. of Melbourne .GehmanLab
VIC 3010 Australia
http://www2.chemistry.unimelb.edu.au/staff/jgehman/research/
"Science really suffers from bureaucracy. If we hadn't broken
every single WHO rule many times over, we would never
have defeated smallpox. Never."
-- Isao Arita, final director of the WHO smallpox eradication program
==== === == = = = = = = = = = =
From: Ross Walker <ross.rosswalker.co.uk<mailto:ross.rosswalker.co.uk>>
Reply-To: AMBER Mailing List <amber.ambermd.org<mailto:amber.ambermd.org>>
Date: Tuesday, 12 March 2013 2:20 AM
To: AMBER Mailing List <amber.ambermd.org<mailto:amber.ambermd.org>>
Subject: Re: [AMBER] GTX 460 ?
Hi John
The list on the amber website is far from exhaustive. Mainly because I
can't keep up with all the various models of GPU that NVIDIA release. The
GTX460 and 465 should both work fine with AMBER although I've not tested
it. The fact that the code runs some MD is indicative that it should work.
What you are seeing is indicative of a faulty GPU. Are there no error
messages reported anywhere? - Does it always fail at the same point or
just roughly the same point? Does the GPU drop off the bus completely
(requiring a reboot to see it again?). Typically when a job will run for a
few minutes and then stops it implies an overheating GPU, maybe a fan not
working properly for example. It could also mean dodgy memory on the GPU
which happens sometimes although in that case the results are normally
crazy before the crash.
Do the test cases all pass?
As for the GTX560 - yes that should work fine.
All the best
Ross
On 3/10/13 10:54 PM, "John Gehman" <jgehman.unimelb.edu.au<mailto:jgehman.unimelb.edu.au>> wrote:
Dear Amber Fans,
Could anybody confirm whether or not the nVidia GTX 460 chipset should
work with Amber12? It's not on the list at
http://ambermd.org/gpus/#supported_gpus, which I presume is drawing a
distinction between hardware revision/compute capability 2.1 vs 2.0 [e.g.
for the GTX 465] per the guideline on that page. However, v2.1 *does*
appear to provide double precision, and the GTX560 which *is* OK'd for
Amber12 appears to actually be v2.1 as well (ref
https://developer.nvidia.com/cuda-gpus).
The problem is that my Amber12 jobs seem to die with no errors or
explanation after about 30 ps on my GPU. This has happened for one of my
runs, as well as one of the benchmark runs, which run fine (albeit slow,
of course) on a single CPU.
I am trying to ascertain whether the GPU is to blame, and if so, whether
a GTX560 (Ti) will actually get me going, or not.
Many Thanks!
John Gehman
University of Melbourne
==== === == = = = = = = = = =
=
John Gehman Office +61 3 8344 2417
ARC Future Fellow Fax +61 3 9347 8189
School of Chemistry Magnets +61 3 8344 2470
Bio21 Institute Mobile +61 407 536
585
30 Flemington Rd jgehman.unimelb.edu.au<mailto:jgehman.unimelb.edu.au>
Univ. of Melbourne
.GehmanLab
VIC 3010 Australia
http://www2.chemistry.unimelb.edu.au/staff/jgehman/research/
"Crooked nails hold better" (JDG, unpublished data)
==== === == = = = = = = = = =
=
_______________________________________________
AMBER mailing list
AMBER.ambermd.org<mailto:AMBER.ambermd.org>
http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org<mailto:AMBER.ambermd.org>
http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Mar 11 2013 - 19:00:03 PDT