Re: [AMBER] AMBER 14 DPFP single energy calculations inconsistent

From: Scott Le Grand <varelse2005.gmail.com>
Date: Thu, 29 Jan 2015 08:51:22 -0800

I think I found the problem. There's something I can fix that will make
the manifestation disappear, but the underlying bad news is that I think
your Born Radius parameters are invalid and that's triggering an
uninitialized variable, the equivalent of a single threaded race condition
because God only knows the order of computation and what ends up in that
uninitialized variable.

But the thing is that that variable should *never* see a situation where it
doesn't get a value. It doing so is a sign there's something wrong with
your Born Radii so I am somewhat tempted to leave the code as is because
there's no way I'm putting a printf warning in an innermost loop on the GPU.

David, Ross et al., is there a valid situation in the Born Radius
calculation where dr should not get a value? Or should it *always* fall
into one of the clauses as I previosuly believed?

Scott




On Wed, Jan 28, 2015 at 7:52 PM, Scott Le Grand <varelse2005.gmail.com>
wrote:

> It's a race condition alright, and there's no reason for it, I'm raising
> the odds of a compiler bug to 20%...
>
> kCalculateGBBornRadii.h line 211 is where this happens. No obvious cause
> whatsoever, but there was some bizarro compiler bug a few years ago in this
> kernel so it's not entirely impossible...
>
>
> On Wed, Jan 28, 2015 at 12:22 PM, Scott Le Grand <varelse2005.gmail.com>
> wrote:
>
>> So I found where this is happening and...
>>
>> For the moment, there's no bug. This is a crazy race condition in
>> kReduceGBBornRadii. If you're CUDA ambitious, you can look at this kernel
>> and see that there's no way to cause a race condition here unless the
>> output pointers are messed up. And so far they look fine. Stay tuned, 5%
>> chance of a compiler bug here (right now 95% chance I'm begin dumb)...
>>
>>
>>
>> On Wed, Jan 28, 2015 at 10:27 AM, R.G. Mantell <rgm38.cam.ac.uk> wrote:
>>
>>> Hi Ross,
>>>
>>> The EGB energies in my previous email are from my original input but
>>> using your suggestion of 'imin=0, nstlim=1, ntpr=1' in min.in.
>>>
>>> I have some structures which are slightly lower in RMS force here:
>>> http://www-wales.ch.cam.ac.uk/rosie/lowrms/
>>> They were generated using our CUDA L-BFGS minimiser interfaced with
>>> AMBER 12 DPDP and I see the same problems when the structures are put
>>> into pmemd.cuda. Unfortunately I can't get any structures which have a
>>> very low RMS force as the inconsistent energies are confusing the line
>>> search in the minimiser, but I've managed to get it a bit lower than the
>>> original structure. Still waiting on the CPU minimisation...
>>>
>>> I don't know whether it happens with PME simulations as we only have GB
>>> interfaced with our code.
>>>
>>> Thanks,
>>>
>>> Rosie
>>>
>>> On 2015-01-27 18:54, Ross Walker wrote:
>>> > Hi Rosemary,
>>> >
>>> > Okay, this definitely looks like a bug - although a weird one. Can you
>>> > send me the input files you used for the EGB energies you list below
>>> > (the structure with a low RMS force?) and I'll test this.
>>> >
>>> > One quick question - does this only happen with GB simulations - or
>>> > have you seen such behavior with PME simulations as well?
>>> >
>>> > All the best
>>> > Ross
>>> >
>>> >
>>> >> On Jan 27, 2015, at 10:13 AM, Rosemary Mantell <rgm38.cam.ac.uk>
>>> >> wrote:
>>> >>
>>> >> Hi Ross,
>>> >>
>>> >> I should probably mention that I first saw this problem when using the
>>> >> AMBER 12 DPDP model, so it's not just a problem with fixed precision.
>>> >>
>>> >> I've set a longer CPU minimisation running today as you suggested,
>>> >> though it will be a little while until it finishes. I will let you
>>> >> know
>>> >> what I find when it's done. However, I also have an L-BFGS minimiser
>>> >> written in CUDA that I have interfaced with the AMBER 12 DPDP
>>> >> potential
>>> >> and I have been using this to run minimisations with this system.
>>> >> Although the minimisations don't converge properly (the linesearch in
>>> >> the minimiser is not tolerant of the fluctuating energies that are
>>> >> being
>>> >> produced), I was able to generate some structures with a much lower
>>> >> RMS
>>> >> force and put these back into pmemd.cuda. I am still seeing the same
>>> >> problem with DPFP and not with SPFP for a variety of different
>>> >> structures.
>>> >>
>>> >> I also tried 'imin=0, nstlim=1, ntpr=1' and the EGB energies I got for
>>> >> 10 tests with DPFP are: -119767.0113, -119763.2412, -119764.3177,
>>> >> -119764.4183, -119763.3771, -119765.8321, -119765.3539, -119764.3328,
>>> >> -119764.1440, -119764.9855.
>>> >>
>>> >> Thanks,
>>> >>
>>> >> Rosie
>>> >>
>>> >> On 26/01/2015 15:48, Ross Walker wrote:
>>> >>> Hi Rosie,
>>> >>>
>>> >>> This does indeed look concerning. Although is not surprising if your
>>> >>> structure is highly strained. The fixed precision model is such that
>>> >>> if energies or forces are too large they will overflow the fixed
>>> >>> precision accumulators. This should never happen during MD since the
>>> >>> forces would be so large as to cause the system to explode. But it
>>> >>> can happen in minimization - but given minimization is designed just
>>> >>> to clean up highly strained structures it should not be a concern.
>>> >>> The first thing we should do though is establish if this is the case
>>> >>> here or if this is a more deeply rooted bug.
>>> >>>
>>> >>> Can you first run a few thousand steps of minimization of your
>>> >>> structure using the CPU and then from the restart files you get from
>>> >>> that repeat your tests (just pick a single GPU model and CUDA version
>>> >>> as that should not be relevant unless the GPU is faulty but that's
>>> >>> unlikely given what you describe) - try it 10 times or so with SPFP
>>> >>> and DPFP and see what you get. This will give us an idea of where to
>>> >>> start looking.
>>> >>>
>>> >>> Could you also try, instead of imin=1 setting:
>>> >>>
>>> >>> imin=0, nstlim=1, ntpr=1 and see what you get reported there for the
>>> >>> energies. This does the same calculation but throuhg the MD routines
>>> >>> rather than the minimization routines.
>>> >>>
>>> >>> When I get a chance later today I'll also try it on my own machine
>>> >>> with the input you provided.
>>> >>>
>>> >>> All the best
>>> >>> Ross
>>> >>>
>>> >>>> On Jan 26, 2015, at 7:34 AM, R.G. Mantell <rgm38.cam.ac.uk> wrote:
>>> >>>>
>>> >>>> I'm not doing a full minimisation. I am using imin = 1, maxcyc = 0,
>>> >>>> ncyc
>>> >>>> = 0, so would hope to get the same energy if I ran this same
>>> >>>> calculation using DPFP several times. Running five times I get: EGB
>>> >>>> =-119080.5069, EGB = -119072.8449, EGB = -119079.8208, EGB =
>>> >>>> -119076.1230, EGB = -119073.7929
>>> >>>> If I do this same test with another system, I get the same EGB
>>> >>>> energy
>>> >>>> every time.
>>> >>>>
>>> >>>> Thanks,
>>> >>>>
>>> >>>> Rosie
>>> >>>>
>>> >>>> On 2015-01-26 15:09, David A Case wrote:
>>> >>>>> On Mon, Jan 26, 2015, R.G. Mantell wrote:
>>> >>>>>> I am having some problems with pmemd.cuda_DPFP in AMBER 14 and
>>> >>>>>> also
>>> >>>>>> seeing the same problems in AMBER 12 with DPDP and SPDP precision
>>> >>>>>> models. I have some input for which a single energy calculation
>>> >>>>>> does
>>> >>>>>> not
>>> >>>>>> yield the same energy each time I run it. Looking at min.out, it
>>> >>>>>> seems
>>> >>>>>> that it is the EGB component which gives a different value each
>>> >>>>>> time.
>>> >>>>>> This does not occur when using SPFP or the CPU version of AMBER. I
>>> >>>>>> do
>>> >>>>>> not see this problem when using input for other systems. I have
>>> >>>>>> tried
>>> >>>>>> the calculation on a Tesla K20m GPU and a GeForce GTX TITAN Black
>>> >>>>>> GPU
>>> >>>>>> using several different versions of the CUDA toolkit. I see the
>>> >>>>>> same
>>> >>>>>> problem with both igb=1 and igb=2. The input which causes the
>>> >>>>>> problem
>>> >>>>>> can be found here:
>>> >>>>>> http://www-wales.ch.cam.ac.uk/rosie/nucleosome_input/
>>> >>>>> Can you say how different the values are on each run? What you
>>> >>>>> describe is
>>> >>>>> exactly what should be expected: parallel runs (and all GPU runs
>>> >>>>> are
>>> >>>>> highly
>>> >>>>> parallel) with DPDP or SPDP are not deterministic, whereas Amber's
>>> >>>>> SPFP
>>> >>>>> is.
>>> >>>>>
>>> >>>>> On the other hand, if you are seeing significant differences
>>> >>>>> between
>>> >>>>> runs for
>>> >>>>> DPDP, that might indicate a bug that needs to be examined.
>>> >>>>>
>>> >>>>> ...thx...dac
>>> >>>>>
>>> >>>>>
>>> >>>>> _______________________________________________
>>> >>>>> AMBER mailing list
>>> >>>>> AMBER.ambermd.org
>>> >>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>> >>>> _______________________________________________
>>> >>>> AMBER mailing list
>>> >>>> AMBER.ambermd.org
>>> >>>> http://lists.ambermd.org/mailman/listinfo/amber
>>> >>>
>>> >>> _______________________________________________
>>> >>> AMBER mailing list
>>> >>> AMBER.ambermd.org
>>> >>> http://lists.ambermd.org/mailman/listinfo/amber
>>> >>
>>> >>
>>> >> _______________________________________________
>>> >> AMBER mailing list
>>> >> AMBER.ambermd.org
>>> >> http://lists.ambermd.org/mailman/listinfo/amber
>>> >
>>> >
>>> > _______________________________________________
>>> > AMBER mailing list
>>> > AMBER.ambermd.org
>>> > http://lists.ambermd.org/mailman/listinfo/amber
>>>
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>>
>>
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jan 29 2015 - 09:00:04 PST
Custom Search