Re: [AMBER] NaN Error in mdout, restrt, mdcrd, mdvel files with pmemd.cuda

From: Ross Walker <rosscwalker.gmail.com>
Date: Sun, 28 Aug 2011 07:48:31 +0800

Hi Farid

This is one of those things that is very difficult to debug but if changing the GPU works it is almost certainly hardware or some kind of driver bug. You could try switching to cuda 4.0 and the very latest development driver.

Note I have seen a number of incompatibilities with cheap motherboards that claim to support 2 gpus but do not. My own desktop for example when I run the test cases. GTX580 with ID 0 gives NANs for JAC and DHFR test cases but GPU ID 1 works. Swapping the gpus over show is is not gnu related since gnu id 0 still fails. Another machine with an identical motherboard also show the exact same problem but other machines I have with different makes work fine. Strangely it is only Supermicro boards that I have ever had problems with and these are supposed to be the good ones. :-( Stick with Asus for gaming type machines they appear to be much more reliable.

If you are running X windows as well this is most likely part of your problem. Do NOT run x windows while running pmend.cuda or if you really must go get yourself a cheap $50 card to just run the X display.

Something else to try. Run the complete 'make test.cuda' test suite on GPU 0 and then repeat on GPU 1 and see if they both pass perfectly. If one fails then you may have a dodgy GPU. Physically swap the gpus over and try again. If the errors 'follow' the card then it is the GPU at fault. If the error remains on the same GPU ID (even though you swapped the hardware) then you have a motherboard incompatibility or something related to you running x windows on that GPU.

A bios update and or driver update 'may' help.

All the best
Ross



On Aug 28, 2011, at 4:13, "Ismail, Mohd F." <farid.ou.edu> wrote:

> I just met the previously discussed (see http://archive.ambermd.org/201101/0301.html and http://archive.ambermd.org/201101/0391.html ) NaN error. My system is a dual opteron with GTX 590, applying all the bugfixes 17 for Amber11, and bugfixes 13, 14, 15 (bugfixes all) for AmberTools1.5.
>
> I'm using gcc 4.5, and cuda toolkit 3.2, on OpenSUSE 11.4 x64.
>
> The system is organic solvent with ~40,000 atoms. Using ntp=1, ntb=2, ntt=1. Running with pmemd.MPI eliminates the error. Changing the GPU also eliminates the error (CUDA_VISIBLE_DEVICES).
>
> I suspect this is a memory error on the particular GPU, and sure enough, rebooting the system solved the problem. My question is, is it really memory error or is it some known issue with GTX5XX?
>
> --Farid
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sat Aug 27 2011 - 17:00:03 PDT
Custom Search