Re: [AMBER] GB simulation on GPU freezes

From: Ross Walker <ross.rosswalker.co.uk>
Date: Mon, 17 Oct 2011 15:24:45 -0700

Hi Elif,

> The simulations do not freeze on different clusters (with 2070s), so
> apparently it is something specific to that particular cluster. Is
> there any
> specific programs that Amber would encounter an incompatibility, for
> instance during the recording of the output files?
>
> Or is there any way to increase the verbosity of the *out files or
> mdinfo
> file that would help me detect what causes my simulations to lock up??

I have run this over the weekend on my machine as well and it still hasn't
locked up. This unfortunately suggests it is a hardware issue but then you
mention a particular cluster you see this on which is disconcerting.

To clarify. The lockups that you see, have they always been on a specific
machine (read single node) or do they occur on a specific cluster (read
multiple nodes)? The difference is important because it focuses on whether
the problem is related do a specific piece of hardware, say a dodgy
motherboard or powersupply, or to a more widespread thing such as some
strange driver / compiler version incompatibility or perhaps a complete
system configured with powersupply that are right on the edge of having a
high enough rating to support the node configuration.

Could you possibly provide a detailed summary of where you have seen
problems and where you haven't. That way we can try to find a suitable
workaround for you.

Also note that one of the issues of lockups was something we worked around
with a number of the bugfixes. So part of me still wonders if perhaps on the
machine you are currently running on that works without problems you are
definitely using the latest version of the code but on the machine that you
see the lockups with perhaps the executable being used (perhaps because of
something in a qsub file overriding AMBERHOME or changing your path etc) is
actually an older one and you are running into issues we have since worked
around. You can check this by looking at the beginning of your mdout file.
You should see something corresponding to this:

|--------------------- INFORMATION ----------------------
| GPU (CUDA) Version of PMEMD in use: NVIDIA GPU IN USE.
| Version 2.2
|
| 08/16/2011
|
|
| Implementation by:
| Ross C. Walker (SDSC)
| Scott Le Grand (nVIDIA)
| Duncan Poole (nVIDIA)
|
| CAUTION: The CUDA code is currently experimental.
| You use it at your own risk. Be sure to
| check ALL results carefully.
|
| Precision model in use:
| [SPDP] - Hybrid Single/Double Precision (Default).
|
|--------------------------------------------------------

The key point here is the Version number, here it is 2.2 AND the date which
is 08/16/2011

The example I show here is the current up to date released version. If you
see something else then this is almost certainly the issue. If you can
definitely confirm it is locking up with the latest version then we can try
to dig deeper into what might be the problem.

Possible overheating
--------------------

Given you are definitely using the latest version of the code another
problem might be with overheating.

Something to note regarding lockups is that I have seen this with M2090
cards in my test system. However, the caveat with this is that these cards
are passively cooled. When the cooling was not sufficient they would very
quickly hang during a simulation. However, this was different to the hanging
seen with older code on the GTX480 cards for example. There the code would
hang and you could kill it and resubmit the job. With my M2090 test bed it
was a hard lockup of the card. The driver would actually drop the card from
showing up within the OS. Running device query would show only the
alternative C2070 card being present in the system and it would require a
hard reboot of the machine for the M2090 to be seen again. Fitting
additional fans to the case solved this problem. This is why it is very
important to know the exact details and circumstances surrounding the cases
where you do see lockups and where you don't.

Note, for what it's worth I had a desktop machine that would lockup
regularly where the OS would lock when just running regular calculations. I
finally tracked this down to the CPU overheating and removing the heatsink I
found the thermal grease had not been applied correctly. Fixing this
resulted in a stable machine. Thus I don't rule out overheating being the
cause of the problem here.

All the best
Ross

/\
\/
|\oss Walker

---------------------------------------------------------
| Assistant Research Professor |
| San Diego Supercomputer Center |
| Adjunct Assistant Professor |
| Dept. of Chemistry and Biochemistry |
| University of California San Diego |
| NVIDIA Fellow |
| http://www.rosswalker.co.uk | http://www.wmd-lab.org/ |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
---------------------------------------------------------

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.





_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Oct 17 2011 - 15:30:03 PDT
Custom Search