Re: [AMBER] cuda lauch time out error in Amber 11

From: Ross Walker <ross.rosswalker.co.uk>
Date: Fri, 27 Aug 2010 05:57:45 -0700

Hi Sergio,

> I am using a GTX470 card under Linux RH 4.8. I have installed and
> tested both the Cuda SDK and Amber 11.
> The deviceQuery program produces the expected output. This card
> supports Cuda 2.0, just like the GTX480. There is no Windows OS on the
> hard drive, only Linux. The host machine is an AMD Phenom II Quad
> processor with 8 GB ram, and a 650 W power supply. The machine has
> only one video card, but all my runs are done via remote access without
> logging in to the console via x-windows. If I set the machine to
> default run level 3 in the inittab file to prevent X-windows from
> starting, then the deviceQuery program fails. The driver apparently
> needs run level 5.

Is the kernel module running?

Does lsmod show you the nvidia driver loaded. My guess would be that having
runlevel 3 means the driver does not get loaded. Try 'modprobe nvidia' and
then try running the devicequery command again.
 
> I have successfully run about four 20 ns trajectories on a 6 kDa
> protein with explicit TIP3P water for a total of 25,000 atoms, water
> included. However, when I attempt to run a larger protein with a total
> of 63,000 or 68,000 atoms (water included), I get the "Error: the
> launch timed out and was terminated launching kernel
> kPMEGetGridWeights" that has been reported here by Wookyung. This
> error occurs whether I run an NVT or an NPT ensemble, anywhere from 0.1
> ns to 1 ns of the run. I have not seen the error during the preliminary
> energy minimization steps before MD.

My initial guess would be that you are running out of memory on the card
except the details below suggest otherwise. 1.5GB should be enough to around
90K atoms or so but there may be subtle issues with your system. Note the
upcoming patch for parallel GPU support will also improve the memory usage
to allow 408K on a C2050 etc.

What cut off are you using? Also is your density good etc, is there anything
that might be non-standard?

The fact it happens during the run suggests it might be related to some
change in density etc.

It could of course be a 'bug'.

> If pmemd.cuda is allocating more memory than the card has (1.3 GB for
> the GTX 470, vs about 1.5 GB for the GTX480), then I would expect the
> run to terminate consistently at the same point every time, but the
> program terminates at seemingly random times. Restarting the run
> yields the same behavior - about another ns can be accumulated before
> the error creeps up again.

This sounds like a dodgy card to me. Does it happen with smaller systems?

> I plan to increase the number of atoms from 25,000 upwards to see at
> what size the error begins to generate an idea of what size problems
> can be done on this card. However, I would like to understand what is
> going on. Thanks for any feedback.

I would try swapping out the card if you can. Or check the heatsink etc,
make sure the fan is running. Take a look at the nvidia-settings tool as
well while it is running and see if the temperature is spiking. Try running
it with the case off as well and a fan pointing at the graphics card. This
will let you know if it is heat or not.

Also, note a 650W power supply is probably a little wimpy for such a set up.

All the best
Ross


/\
\/
|\oss Walker

---------------------------------------------------------
| Assistant Research Professor |
| San Diego Supercomputer Center |
| Adjunct Assistant Professor |
| Dept. of Chemistry and Biochemistry |
| University of California San Diego |
| NVIDIA Fellow |
| http://www.rosswalker.co.uk | http://www.wmd-lab.org/ |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
---------------------------------------------------------

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.




_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Aug 27 2010 - 06:00:03 PDT
Custom Search