Re: [AMBER] cuda launch time out error in Amber 11

From: Scott Le Grand <SLeGrand.nvidia.com>
Date: Fri, 27 Aug 2010 21:29:24 -0700

I once gave myself a major burn handling a GTX480 card that was running AMBER. A few weeks later that card's temperature sensor died and so did the card shortly thereafter. Overheating here would not shock me. I doubt it's running out of memory. What you're seeing is the simulation fly off to Neptune and the kernel launch failure is where it will first manifest. A bogus math result anywhere is enough to do this in 5-10 iterations or so...

If it's not happening on Tesla, and it is happening on the 470, which has the same SM count as a C2050, I really don't think it's a bug. It's possible there's some bizarro OS issue in our driver that's somehow causing a mislaunch, but it seems awfully unlikely right now as that would *probably* hit the Tesla too.

Can you downclock to 1.15 GHz?



-----Original Message-----
From: Sergio R Aragon [mailto:aragons.sfsu.edu]
Sent: Friday, August 27, 2010 11:07
To: AMBER Mailing List
Cc: Anton Guliaev
Subject: Re: [AMBER] cuda launch time out error in Amber 11

Hi Ross, Sasha, Scott,

If I run /sbin/init 3 after booting on run level 5, then the driver does get loaded, so there isn't a problem with running under level 3, as Sasha also pointed out. I would think that even booting under level 5 but not login in at the console should never cause the X server to act if I only use a terminal window to access the machine, as I have been doing. But I'll do my further tests under run level 3 now.
 BTW, the deviceQuery output for the GTX470 card is at the end of this message for reference.

As for how I do my runs, I am using an electrostatic cutoff of 9 A, with a solvent size box of 10A. I run 1 ns of NPT MD after energy minimization in order to equilibrate the density to values very near one. Then I switch to NVT for production to take advantage of the higher processing speed under that ensemble. I have found the launch error, always on that GetGrid command, only when I commence production. Staying in the NPT ensemble always yields the error, usually within 1 ns of production, for systems larger than 60,000 atoms in this card. (I have not tested atom numbers between 25,000 and 60,000 yet).

It's good news that turning off ECC on a Tesla card does not produce the error. Of the environmental factors to consider, the most important ones are the temperature and the power supply. Both of these factors will yield different performance in different cards, and also will show system size dependence (number of atoms). I am thinking that when a larger number of atoms is used, more memory is allocated, consuming more current in the GPU, increasing its temperature. I will verify this by monitoring the temperature during a 25,000 atom run compared to a 68,000 atom run.

What is the expected error message when the GPU memory is insufficient? If a GTX480 card can handle 90,000 atoms, then the GTX470 should be able to handle 75,000 atoms, scaling on the memory size. Doesn't look like I'm pushing the memory limit. How can one determine the actual amount of allocated GPU memory during a run?

My card is an MSI N470GTX and is not overclocked as far as I can tell. The device Query output below should verify (Clock 1.22 GHz). I also do not overclock the quad processor.

Increasing the capacity of the power supply sounds like a good idea. I'm hesitant about RMA'ing the present card because if hardware can be flaky, the new card could be worse (look at Sasha's card - error every few ns on a GTX480 with only 20,000 atoms). At least I can run smaller systems with no problem. I'll try other things first - the temperature measurements should be very indicative.

Sasha's latest comment noting that only the Fermi based consumer cards are giving this error is very interesting. His comments on the power supply issue are also illuminating - doesn't look like power is the source of the problem. This is starting to look like a hardware issue on the Fermi consumer cards (GTX480,470,460). At SFSU a GTX240 has also not seen this error. Sasha has 8 GTX480 cards from Asus and all appear to behave the same, even one at a time. This strengthens the case for a hardware issue on Fermi consumer cards. The attractiveness of the 400 series GTX is the larger memory compared to the 200 series cards. We want to do larger MD simulations. Can Nvidia pay some attention to this problem?
 
Scott Le Grand - care to chime in at this point of the conversation?

Thank you all, Cheers, Sergio


CUDA Device Query (Runtime API) version (CUDART static linking)

There is 1 device supporting CUDA

Device 0: "GeForce GTX 470"
  CUDA Driver Version: 3.10
  CUDA Runtime Version: 3.10
  CUDA Capability Major revision number: 2
  CUDA Capability Minor revision number: 0
  Total amount of global memory: 1341325312 bytes
  Number of multiprocessors: 14
  Number of cores: 448
  Total amount of constant memory: 65536 bytes
  Total amount of shared memory per block: 49152 bytes
  Total number of registers available per block: 32768
  Warp size: 32
  Maximum number of threads per block: 1024
  Maximum sizes of each dimension of a block: 1024 x 1024 x 64
  Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
  Maximum memory pitch: 2147483647 bytes
  Texture alignment: 512 bytes
  Clock rate: 1.22 GHz
  Concurrent copy and execution: Yes
  Run time limit on kernels: No
  Integrated: No
  Support host page-locked memory mapping: Yes
  Compute mode: Default (multiple host threads can use this device simultaneously)
  Concurrent kernel execution: Yes
  Device has ECC support enabled: No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.10, CUDA Runtime Version = 3.10, NumDevs = 1, Device = GeForce GTX 470


PASSED

-----Original Message-----
From: Ross Walker [mailto:ross.rosswalker.co.uk]
Sent: Friday, August 27, 2010 5:58 AM
To: 'AMBER Mailing List'
Subject: Re: [AMBER] cuda lauch time out error in Amber 11

Hi Sergio,

> I am using a GTX470 card under Linux RH 4.8. I have installed and
> tested both the Cuda SDK and Amber 11.
> The deviceQuery program produces the expected output. This card
> supports Cuda 2.0, just like the GTX480. There is no Windows OS on the
> hard drive, only Linux. The host machine is an AMD Phenom II Quad
> processor with 8 GB ram, and a 650 W power supply. The machine has
> only one video card, but all my runs are done via remote access without
> logging in to the console via x-windows. If I set the machine to
> default run level 3 in the inittab file to prevent X-windows from
> starting, then the deviceQuery program fails. The driver apparently
> needs run level 5.

Is the kernel module running?

Does lsmod show you the nvidia driver loaded. My guess would be that having
runlevel 3 means the driver does not get loaded. Try 'modprobe nvidia' and
then try running the devicequery command again.
 
> I have successfully run about four 20 ns trajectories on a 6 kDa
> protein with explicit TIP3P water for a total of 25,000 atoms, water
> included. However, when I attempt to run a larger protein with a total
> of 63,000 or 68,000 atoms (water included), I get the "Error: the
> launch timed out and was terminated launching kernel
> kPMEGetGridWeights" that has been reported here by Wookyung. This
> error occurs whether I run an NVT or an NPT ensemble, anywhere from 0.1
> ns to 1 ns of the run. I have not seen the error during the preliminary
> energy minimization steps before MD.

My initial guess would be that you are running out of memory on the card
except the details below suggest otherwise. 1.5GB should be enough to around
90K atoms or so but there may be subtle issues with your system. Note the
upcoming patch for parallel GPU support will also improve the memory usage
to allow 408K on a C2050 etc.

What cut off are you using? Also is your density good etc, is there anything
that might be non-standard?

The fact it happens during the run suggests it might be related to some
change in density etc.

It could of course be a 'bug'.

> If pmemd.cuda is allocating more memory than the card has (1.3 GB for
> the GTX 470, vs about 1.5 GB for the GTX480), then I would expect the
> run to terminate consistently at the same point every time, but the
> program terminates at seemingly random times. Restarting the run
> yields the same behavior - about another ns can be accumulated before
> the error creeps up again.

This sounds like a dodgy card to me. Does it happen with smaller systems?

> I plan to increase the number of atoms from 25,000 upwards to see at
> what size the error begins to generate an idea of what size problems
> can be done on this card. However, I would like to understand what is
> going on. Thanks for any feedback.

I would try swapping out the card if you can. Or check the heatsink etc,
make sure the fan is running. Take a look at the nvidia-settings tool as
well while it is running and see if the temperature is spiking. Try running
it with the case off as well and a fan pointing at the graphics card. This
will let you know if it is heat or not.

Also, note a 650W power supply is probably a little wimpy for such a set up.

All the best
Ross


/\
\/
|\oss Walker

---------------------------------------------------------
| Assistant Research Professor |
| San Diego Supercomputer Center |
| Adjunct Assistant Professor |
| Dept. of Chemistry and Biochemistry |
| University of California San Diego |
| NVIDIA Fellow |
| http://www.rosswalker.co.uk | http://www.wmd-lab.org/ |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
---------------------------------------------------------

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.




_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber



_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Aug 27 2010 - 21:30:03 PDT
Custom Search