Re: [AMBER] NaN Error in mdout, restrt, mdcrd, mdvel files with pmemd.cuda

From: Ismail, Mohd F. <farid.ou.edu>
Date: Sun, 28 Aug 2011 19:38:39 +0000

Thank you Ross. I understand this is one of the most difficult ones. Some other GPU based programs face the similar problem with weird GPU behavior.

To clarify, I didn't swap the GPU with another one, just use the other GPU on the GTX590 with CUDA_VISIBLE_DEVICES. Eithr way, both GPU were back functioning properly with a reboot.

Right now, I won't try the 4.0 toolkit because of compatibility issue with gcc 4.5, maybe I will in the future. I am actually using a supermicro motherboard, maybe that is part of the problem. But I never have problem before with the benchmark files. Even with the problem with my solvent system, it went away with a reboot. Since I'm not running a set of nodes, just a simple tower, I think I can live with a reboot.

I am also not running x windows on the GPUs. They're compute only. The make test.cuda and the test.cuda.parallel ended normally (no comparison fails - just errors which is not related to results).

Thank you for the input.

--Farid
________________________________________
From: Ross Walker [rosscwalker.gmail.com]
Sent: Saturday, August 27, 2011 6:48 PM
To: AMBER Mailing List
Subject: Re: [AMBER] NaN Error in mdout, restrt, mdcrd, mdvel files with pmemd.cuda

Hi Farid

This is one of those things that is very difficult to debug but if changing the GPU works it is almost certainly hardware or some kind of driver bug. You could try switching to cuda 4.0 and the very latest development driver.

Note I have seen a number of incompatibilities with cheap motherboards that claim to support 2 gpus but do not. My own desktop for example when I run the test cases. GTX580 with ID 0 gives NANs for JAC and DHFR test cases but GPU ID 1 works. Swapping the gpus over show is is not gnu related since gnu id 0 still fails. Another machine with an identical motherboard also show the exact same problem but other machines I have with different makes work fine. Strangely it is only Supermicro boards that I have ever had problems with and these are supposed to be the good ones. :-( Stick with Asus for gaming type machines they appear to be much more reliable.

If you are running X windows as well this is most likely part of your problem. Do NOT run x windows while running pmend.cuda or if you really must go get yourself a cheap $50 card to just run the X display.

Something else to try. Run the complete 'make test.cuda' test suite on GPU 0 and then repeat on GPU 1 and see if they both pass perfectly. If one fails then you may have a dodgy GPU. Physically swap the gpus over and try again. If the errors 'follow' the card then it is the GPU at fault. If the error remains on the same GPU ID (even though you swapped the hardware) then you have a motherboard incompatibility or something related to you running x windows on that GPU.

A bios update and or driver update 'may' help.

All the best
Ross



On Aug 28, 2011, at 4:13, "Ismail, Mohd F." <farid.ou.edu> wrote:

> I just met the previously discussed (see http://archive.ambermd.org/201101/0301.html and http://archive.ambermd.org/201101/0391.html ) NaN error. My system is a dual opteron with GTX 590, applying all the bugfixes 17 for Amber11, and bugfixes 13, 14, 15 (bugfixes all) for AmberTools1.5.
>
> I'm using gcc 4.5, and cuda toolkit 3.2, on OpenSUSE 11.4 x64.
>
> The system is organic solvent with ~40,000 atoms. Using ntp=1, ntb=2, ntt=1. Running with pmemd.MPI eliminates the error. Changing the GPU also eliminates the error (CUDA_VISIBLE_DEVICES).
>
> I suspect this is a memory error on the particular GPU, and sure enough, rebooting the system solved the problem. My question is, is it really memory error or is it some known issue with GTX5XX?
>
> --Farid
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sun Aug 28 2011 - 13:00:02 PDT
Custom Search