Run JAC NVE for 100,000 iterations. If it crashes, you have a bad GPU.
On Mar 11, 2013 6:59 PM, "John Gehman" <jgehman.unimelb.edu.au> wrote:
> Many Thanks Jason, Hector, and Ross,
>
> To answer Ross's questions:
> -- No, I cannot find any error messages anywhere — I've checked md.out
> files, /var/log files, monitored nvidia-smi, no evidence of any problems.
> -- I've confirmed that the fan is running fine, however I think it's
> probably correct that the fault is temperature related: I tested again this
> morning, at slightly cooler ambient temperature than the tests I reported
> earlier which were run later in the day (during a general heat wave here in
> Australia) -- the md ran longer this time (2-3 minutes), but I think failed
> at similar temperatures (last caught temps with manual nvidia-smi updates
> before failure were 67-70C)
> -- The card does not fall off the bus — no reboot required, and from what
> I can find on the web, I believe there should be a log entry in /var/log if
> I were to suffer such an event.
> -- I'll probe a bit further into the results to look for "crazy". The cuda
> tests reported no bona fide errors, and 7/88 "possible failures", all of
> which were "Maximum * error …" messages for differences in the last digit
> of specified values; all the tests fundamentally ran and completed, though.
>
> CUDA-Z runs fine for as long as I've left it go (longer than the two
> minutes that it fails running AMBER), although the temperature doesn't hit
> the same level.
>
> Certainly please let me know if the above follow-up sheds any more light
> on the matter, but it all sounds fairly likely that I've got a dodgy card,
> and buying a replacement is warranted. I take your point, Jason, that
> quality/reliability and performance may *both* scale with the model
> selected, even if Hector got lucky. Maybe I need to have another look down
> the back of the sofa before going shopping. Many thanks for your help!
>
> Kind Regards,
> John
>
> ==== === == = = = = = = = = = =
> John Gehman Office +61 3 8344 2417
> ARC Future Fellow Fax +61 3 9347 8189
> School of Chemistry Magnets +61 3 8344 2470
> Bio21 Institute Mobile +61 407 536 585
> 30 Flemington Rd jgehman.unimelb.edu.au
> Univ. of Melbourne .GehmanLab
> VIC 3010 Australia
> http://www2.chemistry.unimelb.edu.au/staff/jgehman/research/
>
> "Science really suffers from bureaucracy. If we hadn't broken
> every single WHO rule many times over, we would never
> have defeated smallpox. Never."
> -- Isao Arita, final director of the WHO smallpox eradication program
>
> ==== === == = = = = = = = = = =
>
>
>
>
>
> From: Ross Walker <ross.rosswalker.co.uk<mailto:ross.rosswalker.co.uk>>
> Reply-To: AMBER Mailing List <amber.ambermd.org<mailto:amber.ambermd.org>>
> Date: Tuesday, 12 March 2013 2:20 AM
> To: AMBER Mailing List <amber.ambermd.org<mailto:amber.ambermd.org>>
> Subject: Re: [AMBER] GTX 460 ?
>
> Hi John
>
> The list on the amber website is far from exhaustive. Mainly because I
> can't keep up with all the various models of GPU that NVIDIA release. The
> GTX460 and 465 should both work fine with AMBER although I've not tested
> it. The fact that the code runs some MD is indicative that it should work.
> What you are seeing is indicative of a faulty GPU. Are there no error
> messages reported anywhere? - Does it always fail at the same point or
> just roughly the same point? Does the GPU drop off the bus completely
> (requiring a reboot to see it again?). Typically when a job will run for a
> few minutes and then stops it implies an overheating GPU, maybe a fan not
> working properly for example. It could also mean dodgy memory on the GPU
> which happens sometimes although in that case the results are normally
> crazy before the crash.
>
> Do the test cases all pass?
>
> As for the GTX560 - yes that should work fine.
>
> All the best
> Ross
>
>
> On 3/10/13 10:54 PM, "John Gehman" <jgehman.unimelb.edu.au<mailto:
> jgehman.unimelb.edu.au>> wrote:
>
> Dear Amber Fans,
>
> Could anybody confirm whether or not the nVidia GTX 460 chipset should
> work with Amber12? It's not on the list at
> http://ambermd.org/gpus/#supported_gpus, which I presume is drawing a
> distinction between hardware revision/compute capability 2.1 vs 2.0 [e.g.
> for the GTX 465] per the guideline on that page. However, v2.1 *does*
> appear to provide double precision, and the GTX560 which *is* OK'd for
> Amber12 appears to actually be v2.1 as well (ref
> https://developer.nvidia.com/cuda-gpus).
>
> The problem is that my Amber12 jobs seem to die with no errors or
> explanation after about 30 ps on my GPU. This has happened for one of my
> runs, as well as one of the benchmark runs, which run fine (albeit slow,
> of course) on a single CPU.
>
> I am trying to ascertain whether the GPU is to blame, and if so, whether
> a GTX560 (Ti) will actually get me going, or not.
>
> Many Thanks!
> John Gehman
> University of Melbourne
>
> ==== === == = = = = = = = = =
> =
> John Gehman Office +61 3 8344 2417
> ARC Future Fellow Fax +61 3 9347 8189
> School of Chemistry Magnets +61 3 8344 2470
> Bio21 Institute Mobile +61 407 536
> 585
> 30 Flemington Rd jgehman.unimelb.edu.au
> <mailto:jgehman.unimelb.edu.au>
> Univ. of Melbourne
> .GehmanLab
> VIC 3010 Australia
> http://www2.chemistry.unimelb.edu.au/staff/jgehman/research/
>
> "Crooked nails hold better" (JDG, unpublished data)
>
> ==== === == = = = = = = = = =
> =
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org<mailto:AMBER.ambermd.org>
> http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org<mailto:AMBER.ambermd.org>
> http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Mar 12 2013 - 11:00:02 PDT