Re: [AMBER] GB simulation on GPU freezes

From: E. Nihal Korkmaz <enihalkorkmaz.gmail.com>
Date: Mon, 17 Oct 2011 21:08:13 -0500

My out files do not say version 2.2! I applied them as directed in the amber
site, apparently some steps have failed to get patched correctly.

Now i am trying to patch them right but when i do
cd $AMBERHOME
 ./apply_bugfix.x bugfix.1to17.tar.bz2

It launches the 1st bugfix and expects a filename to patch like that:

|Use this patch in $AMBERHOME/src/sander/qm_mm.f
|------------------------------------------------------------------------------
|--- src/sander/qm_mm.f 2010-05-11 20:24:37.000000000 -0700
|+++ src/sander/qm_mm.f 2010-05-11 20:24:54.000000000 -0700
--------------------------
File to patch:

What would be the right file to patch?



Thank you very much for the support.
Best,
Nihal



On Mon, Oct 17, 2011 at 5:24 PM, Ross Walker <ross.rosswalker.co.uk> wrote:

> Hi Elif,
>
> > The simulations do not freeze on different clusters (with 2070s), so
> > apparently it is something specific to that particular cluster. Is
> > there any
> > specific programs that Amber would encounter an incompatibility, for
> > instance during the recording of the output files?
> >
> > Or is there any way to increase the verbosity of the *out files or
> > mdinfo
> > file that would help me detect what causes my simulations to lock up??
>
> I have run this over the weekend on my machine as well and it still hasn't
> locked up. This unfortunately suggests it is a hardware issue but then you
> mention a particular cluster you see this on which is disconcerting.
>
> To clarify. The lockups that you see, have they always been on a specific
> machine (read single node) or do they occur on a specific cluster (read
> multiple nodes)? The difference is important because it focuses on whether
> the problem is related do a specific piece of hardware, say a dodgy
> motherboard or powersupply, or to a more widespread thing such as some
> strange driver / compiler version incompatibility or perhaps a complete
> system configured with powersupply that are right on the edge of having a
> high enough rating to support the node configuration.
>
> Could you possibly provide a detailed summary of where you have seen
> problems and where you haven't. That way we can try to find a suitable
> workaround for you.
>
> Also note that one of the issues of lockups was something we worked around
> with a number of the bugfixes. So part of me still wonders if perhaps on
> the
> machine you are currently running on that works without problems you are
> definitely using the latest version of the code but on the machine that you
> see the lockups with perhaps the executable being used (perhaps because of
> something in a qsub file overriding AMBERHOME or changing your path etc) is
> actually an older one and you are running into issues we have since worked
> around. You can check this by looking at the beginning of your mdout file.
> You should see something corresponding to this:
>
> |--------------------- INFORMATION ----------------------
> | GPU (CUDA) Version of PMEMD in use: NVIDIA GPU IN USE.
> | Version 2.2
> |
> | 08/16/2011
> |
> |
> | Implementation by:
> | Ross C. Walker (SDSC)
> | Scott Le Grand (nVIDIA)
> | Duncan Poole (nVIDIA)
> |
> | CAUTION: The CUDA code is currently experimental.
> | You use it at your own risk. Be sure to
> | check ALL results carefully.
> |
> | Precision model in use:
> | [SPDP] - Hybrid Single/Double Precision (Default).
> |
> |--------------------------------------------------------
>
> The key point here is the Version number, here it is 2.2 AND the date which
> is 08/16/2011
>
> The example I show here is the current up to date released version. If you
> see something else then this is almost certainly the issue. If you can
> definitely confirm it is locking up with the latest version then we can try
> to dig deeper into what might be the problem.
>
> Possible overheating
> --------------------
>
> Given you are definitely using the latest version of the code another
> problem might be with overheating.
>
> Something to note regarding lockups is that I have seen this with M2090
> cards in my test system. However, the caveat with this is that these cards
> are passively cooled. When the cooling was not sufficient they would very
> quickly hang during a simulation. However, this was different to the
> hanging
> seen with older code on the GTX480 cards for example. There the code would
> hang and you could kill it and resubmit the job. With my M2090 test bed it
> was a hard lockup of the card. The driver would actually drop the card from
> showing up within the OS. Running device query would show only the
> alternative C2070 card being present in the system and it would require a
> hard reboot of the machine for the M2090 to be seen again. Fitting
> additional fans to the case solved this problem. This is why it is very
> important to know the exact details and circumstances surrounding the cases
> where you do see lockups and where you don't.
>
> Note, for what it's worth I had a desktop machine that would lockup
> regularly where the OS would lock when just running regular calculations. I
> finally tracked this down to the CPU overheating and removing the heatsink
> I
> found the thermal grease had not been applied correctly. Fixing this
> resulted in a stable machine. Thus I don't rule out overheating being the
> cause of the problem here.
>
> All the best
> Ross
>
> /\
> \/
> |\oss Walker
>
> ---------------------------------------------------------
> | Assistant Research Professor |
> | San Diego Supercomputer Center |
> | Adjunct Assistant Professor |
> | Dept. of Chemistry and Biochemistry |
> | University of California San Diego |
> | NVIDIA Fellow |
> | http://www.rosswalker.co.uk | http://www.wmd-lab.org/ |
> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
> ---------------------------------------------------------
>
> Note: Electronic Mail is not secure, has no guarantee of delivery, may not
> be read every day, and should not be used for urgent or sensitive issues.
>
>
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>



-- 
Elif Nihal Korkmaz
Research Assistant
University of Wisconsin - Biophysics
Member of Qiang Cui & Thomas Record Labs
1101 University Ave, Rm. 8359
Madison, WI 53706
Phone:  608-265-3644
Email:   korkmaz.wisc.edu
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Oct 17 2011 - 19:30:02 PDT
Custom Search