Re: [AMBER] Problematic structure after minimization on GPU from Jan-Philip Gehrcke on 2013-08-26 (Amber Archive Aug 2013)

From: Jan-Philip Gehrcke <jgehrcke.googlemail.com>
Date: Mon, 26 Aug 2013 16:48:52 +0200

Thanks, Ross, for the quick response.

I got notification that my initial message was blocked because of the
(very small) attachment. I sent it a second time you were replying to
the first one, sorry for the confusion -- please ignore the new message.

My mail was too long, sorry. Important points:

- the min1+min2+heat protocol runs fine with the CPU version, i.e. CPU
minimization solves the problem. The crash during heatup, however, also
appears in the CPU version when starting from the minimized structure as
created by the GPU code.

- the error message in the GPU version can be improved. The CPU version
informs about too large velocities. The GPU version just says 'launch
failure launching kernel kNLSkinTest'.

- it is very interesting to understand what is wrong with the structure
in min2.rst as created by the GPU version. MOE does not identify
clashes, the structure looks fine. Any pointers what I can check to
identify the problematic part in this structure?

Two more comments below :)

On 08/26/2013 04:18 PM, Ross Walker wrote:
> Hi Jan-Philip,
>
> I recommend running minimization on the CPU. The fixed precision used on
> the GPU has limited range for forces, about 100x that ever experienced in
> MD at 300K for a reasonable system. However, often systems at the
> beginning of minimization are very 'unreasonable' and generate huge
> initial forces. If you are not using the very latest version of the GPU
> code then these forces cause a wrapping of the integer representation and
> you get complete garbage which breaks the minimizer.

Indeed, I often experienced minimization problems with the old GPU code
versions and adjusted my scripts to use pmemd.MPI for minimization.

> If you are using the
> latest version of the code it truncates the forces at the largest
> representation that SPFP supports - in most cases this works and will get
> you out of trouble but if your initial structure is too strained it will
> also likely break the minimizer.

When I read in the changelog that the truncation was implemented in GPU
minimization code, I switched back to GPU minimization. For the system
in question the mean part is that it 'works' in the sense that the
minimization does not quit with an error and the output looks fine.

Thanks for your time,

Jan-Philip

>
> Essentially this is a limitation of the SPFP precision model - the
> solution is either to run the minimization with CPU or use the SPDP or
> DPDP GPU versions. We are considering changing minimization to be entirely
> SPDP in the next version of the code but to be honest minimization is such
> a minimal amount of time in a simulation project that it has pretty low
> priority over other things so I might just turn it off completely for SPFP
> and print a message saying to build the SPDP/DPDP versions and use that or
> just to use the CPU. I'll update the GPU webpage to have some info on
> minimizations.
>
> Let me know if CPU minimization fixes your system.
>
> Note the same limitations apply to SPFP in MD - that is if your system is
> still highly strained at the beginning of MD the GPU code will likely die.
> It really is designed ONLY for well behaved systems. If you still
> encounter problems with it at the initial MD stage I suggest using the CPU
> code to do the heating and then switch to the GPU code.
>
> All the best
> Ross
>
>
>
>
> On 8/26/13 4:52 AM, "Jan-Philip Gehrcke" <jgehrcke.googlemail.com> wrote:
>
>> Hello,
>>
>> I have a test case for you. It is reproducibly failing on GTX 580, GTX
>> 690, Tesla C2070 using pmemd.cuda (version 12.3.1, 08/07/2013).
>>
>> The system in question is a rather small system. After going through two
>> minimizations, it fails within the first steps of heatup with
>>
>> Error: unspecified launch failure launching kernel kNLSkinTest
>>
>> The problem seems to be in the output structure of the second
>> minimization. When starting heatup from there using the CPU version of
>> pmemd (and same input otherwise), this also fails within a few steps.
>> After the first step, pmemd says in the mdout file:
>>
>> vlimit exceeded for step 0; vmax = 28405.4406
>>
>>
>> After the third step the simulation crashes:
>>
>> vlimit exceeded for step 3; vmax = 64.6955
>>
>> Coordinate resetting cannot be accomplished,
>> deviation is too large
>> iter_cnt, my_bond_idx, i and j are : 2 948 435 434
>>
>> Running the entire protocol (min1, min2, heatup) with the CPU version, I
>> don't observe the problem at all, probably because the minimization
>> takes a different 'path'.
>>
>> The problematic system seems to hit an *extremely* special and therefore
>> unlikely coordinate constellation. Let me explain why I believe this is
>> so rare:
>>
>> In my current study I perform independent simulations of many systems
>> comprised of the same receptor protein and a relatively small ligand
>> molecule, placed distal from the receptor in the (explicit) solvent.
>> Initially, all systems have equivalent receptor coordinates. The ligand
>> molecule is the same in all systems. The internal configuration of the
>> ligand is equivalent in all systems. The placement of the ligand's
>> center of mass is equivalent in all systems. The systems only differ in
>> the rotational state of the ligand around its COM. All of these systems
>> evolve fine during minimization, heatup, equilibration and production.
>> Except for the one that reproducibly fails during heatup. I can make it
>> not to fail during heatup by setting maxcyc from 1000 to 700 in the
>> first minimization -- so this really seems to be an unfortunate und
>> unlikely combination of conditions. And if it wasn't for the awesome
>> simulation reproducibility of recent Amber GPU code, I probably would
>> not have observed this more than once.
>>
>> Regarding the problematic system, the starting structure for heatup (the
>> last restart file of the second minimization), visualized in VMD, looks
>> fine: the ligand is still faaar away from the protein, beautiful water
>> molecules as placed by leap (and already slightly wiggled) are present.
>> I could not find any clashes in that structure (automated search), so to
>> me there is no obvious problem with that file.
>>
>> Visualizing the heatup trajectory recorded with ntwr=1 just shows that
>> the system suddenly explodes in frame 20 or so.
>>
>> I think it is also worth pointing out
>>
>> - that I used the same heatup input settings for a long time now,
>> applied to various different systems. Maybe it's not optimal, but it has
>> worked so far.
>>
>> - that the heatup fails on GPU and CPU with 'ig = -1', so this does not
>> depend on any specific random number sequence.
>>
>> - that the problem in min2.rst does not depend on ASCII or NetCDF
>> storage (I tried both).
>>
>>
>> I see that I myself can simply work around this problem. However, I
>> found it important to share with you, because
>>
>> - the error message in the GPU version can be improved. The CPU version
>> informs about crazy velocities. The GPU version just says 'launch
>> failure launching kernel kNLSkinTest'.
>>
>> - it is absolutely interesting to understand what is wrong with the
>> structure in min2.rst as created by the GPU version, maybe someone can
>> clarify.
>>
>> - there might be a problem in the GPU minimization code that 'creates'
>> the problematic structure.
>>
>> I have created an archive for you:
>>
>> http://gehrcke.de/files/perm/amber130826/heatup-fail-repro.tar.gz (700 kB)
>>
>> It contains the initial coordinate file and the parameter topology file
>> as created by leap, as well as a shell script repro.sh that contains all
>> you need to trigger the problem (just run it, it creates all the
>> relevant amber input). I also attach the content of the script to this
>> mail.
>>
>>
>> Cheers,
>>
>> Jan-Philip
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Aug 26 2013 - 08:00:02 PDT