Re: [AMBER] GB simulation on GPU freezes

From: E. Nihal Korkmaz <enihalkorkmaz.gmail.com>
Date: Tue, 18 Oct 2011 03:08:12 -0500

I deleted everything one more time compiled from scratch, now everything is
working with the right pmemd.cuda version.

Thank you so much for helping me out on this issue.
Best regards,
Nihal

On Mon, Oct 17, 2011 at 9:08 PM, E. Nihal Korkmaz
<enihalkorkmaz.gmail.com>wrote:

> My out files do not say version 2.2! I applied them as directed in the
> amber site, apparently some steps have failed to get patched correctly.
>
> Now i am trying to patch them right but when i do
> cd $AMBERHOME
> ./apply_bugfix.x bugfix.1to17.tar.bz2
>
> It launches the 1st bugfix and expects a filename to patch like that:
>
> |Use this patch in $AMBERHOME/src/sander/qm_mm.f
>
> |------------------------------------------------------------------------------
> |--- src/sander/qm_mm.f 2010-05-11 20:24:37.000000000 -0700
> |+++ src/sander/qm_mm.f 2010-05-11 20:24:54.000000000 -0700
> --------------------------
> File to patch:
>
> What would be the right file to patch?
>
>
>
> Thank you very much for the support.
> Best,
> Nihal
>
>
>
>
> On Mon, Oct 17, 2011 at 5:24 PM, Ross Walker <ross.rosswalker.co.uk>wrote:
>
>> Hi Elif,
>>
>> > The simulations do not freeze on different clusters (with 2070s), so
>> > apparently it is something specific to that particular cluster. Is
>> > there any
>> > specific programs that Amber would encounter an incompatibility, for
>> > instance during the recording of the output files?
>> >
>> > Or is there any way to increase the verbosity of the *out files or
>> > mdinfo
>> > file that would help me detect what causes my simulations to lock up??
>>
>> I have run this over the weekend on my machine as well and it still hasn't
>> locked up. This unfortunately suggests it is a hardware issue but then you
>> mention a particular cluster you see this on which is disconcerting.
>>
>> To clarify. The lockups that you see, have they always been on a specific
>> machine (read single node) or do they occur on a specific cluster (read
>> multiple nodes)? The difference is important because it focuses on whether
>> the problem is related do a specific piece of hardware, say a dodgy
>> motherboard or powersupply, or to a more widespread thing such as some
>> strange driver / compiler version incompatibility or perhaps a complete
>> system configured with powersupply that are right on the edge of having a
>> high enough rating to support the node configuration.
>>
>> Could you possibly provide a detailed summary of where you have seen
>> problems and where you haven't. That way we can try to find a suitable
>> workaround for you.
>>
>> Also note that one of the issues of lockups was something we worked around
>> with a number of the bugfixes. So part of me still wonders if perhaps on
>> the
>> machine you are currently running on that works without problems you are
>> definitely using the latest version of the code but on the machine that
>> you
>> see the lockups with perhaps the executable being used (perhaps because of
>> something in a qsub file overriding AMBERHOME or changing your path etc)
>> is
>> actually an older one and you are running into issues we have since worked
>> around. You can check this by looking at the beginning of your mdout file.
>> You should see something corresponding to this:
>>
>> |--------------------- INFORMATION ----------------------
>> | GPU (CUDA) Version of PMEMD in use: NVIDIA GPU IN USE.
>> | Version 2.2
>> |
>> | 08/16/2011
>> |
>> |
>> | Implementation by:
>> | Ross C. Walker (SDSC)
>> | Scott Le Grand (nVIDIA)
>> | Duncan Poole (nVIDIA)
>> |
>> | CAUTION: The CUDA code is currently experimental.
>> | You use it at your own risk. Be sure to
>> | check ALL results carefully.
>> |
>> | Precision model in use:
>> | [SPDP] - Hybrid Single/Double Precision (Default).
>> |
>> |--------------------------------------------------------
>>
>> The key point here is the Version number, here it is 2.2 AND the date
>> which
>> is 08/16/2011
>>
>> The example I show here is the current up to date released version. If you
>> see something else then this is almost certainly the issue. If you can
>> definitely confirm it is locking up with the latest version then we can
>> try
>> to dig deeper into what might be the problem.
>>
>> Possible overheating
>> --------------------
>>
>> Given you are definitely using the latest version of the code another
>> problem might be with overheating.
>>
>> Something to note regarding lockups is that I have seen this with M2090
>> cards in my test system. However, the caveat with this is that these cards
>> are passively cooled. When the cooling was not sufficient they would very
>> quickly hang during a simulation. However, this was different to the
>> hanging
>> seen with older code on the GTX480 cards for example. There the code would
>> hang and you could kill it and resubmit the job. With my M2090 test bed it
>> was a hard lockup of the card. The driver would actually drop the card
>> from
>> showing up within the OS. Running device query would show only the
>> alternative C2070 card being present in the system and it would require a
>> hard reboot of the machine for the M2090 to be seen again. Fitting
>> additional fans to the case solved this problem. This is why it is very
>> important to know the exact details and circumstances surrounding the
>> cases
>> where you do see lockups and where you don't.
>>
>> Note, for what it's worth I had a desktop machine that would lockup
>> regularly where the OS would lock when just running regular calculations.
>> I
>> finally tracked this down to the CPU overheating and removing the heatsink
>> I
>> found the thermal grease had not been applied correctly. Fixing this
>> resulted in a stable machine. Thus I don't rule out overheating being the
>> cause of the problem here.
>>
>> All the best
>> Ross
>>
>> /\
>> \/
>> |\oss Walker
>>
>> ---------------------------------------------------------
>> | Assistant Research Professor |
>> | San Diego Supercomputer Center |
>> | Adjunct Assistant Professor |
>> | Dept. of Chemistry and Biochemistry |
>> | University of California San Diego |
>> | NVIDIA Fellow |
>> | http://www.rosswalker.co.uk | http://www.wmd-lab.org/ |
>> | Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
>> ---------------------------------------------------------
>>
>> Note: Electronic Mail is not secure, has no guarantee of delivery, may not
>> be read every day, and should not be used for urgent or sensitive issues.
>>
>>
>>
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>
>
>
> --
> Elif Nihal Korkmaz
>
> Research Assistant
> University of Wisconsin - Biophysics
> Member of Qiang Cui & Thomas Record Labs
> 1101 University Ave, Rm. 8359
> Madison, WI 53706
> Phone: 608-265-3644
> Email: korkmaz.wisc.edu
>
>
>


-- 
Elif Nihal Korkmaz
Research Assistant
University of Wisconsin - Biophysics
Member of Qiang Cui & Thomas Record Labs
1101 University Ave, Rm. 8359
Madison, WI 53706
Phone:  608-265-3644
Email:   korkmaz.wisc.edu
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Oct 18 2011 - 01:30:02 PDT
Custom Search