# Re: [AMBER] Is triplicate analysis required for pmemd.MPI and ig=-1?

From: Ross Walker <rosscwalker.gmail.com>
Date: Thu, 17 Dec 2015 14:56:01 -0800

Hi John,

Welcome to the world of sampling error!. Triplicate is probably the minimum you want to do. Indeed you probably want to look for ways to start from different initial structures, or if you only have the one you could run say a 10ns simulation, take snapshots every 1ns and use these to seed 10 additional simulations and so on. Figuring out if your data is converged is an art form - that I don't think anyone in the MD has actually truly mastered. I'd wager that almost all published MD simulations except for those on the simplest of systems were never converged.

Probably a good place to start is to look at a given property that you want to measure. Suppose you want to measure a binding energy. You could run a single long simulation, or some kind of series and calculate your predicted binding energy. Then try throwing away the first half of your data and recompute the binding energy. Then compute it again using just the first half. This will give you an idea of the spread in your data and some crude approximation to the error bars.

You can get more sophisticated here and plot the cumulative value of your binding energy. E.g. use the first 10 points, plot the value. Then use 20 points, plot the value, then 30 etc. This plot will show you the change in the binding energy prediction as you increase your data set and will give you an idea of how well things are converging. Now that said, you'd don't of course know if a sudden massive structural change is about to occur a few frames after you stopped your simulation. This might lead to a sudden jump in your average. Of course this is a standard problem in statistics. If I could predict what would happen a few steps ahead of what I have looked at so far I would be making a fortune on the stock market instead of developing MD software. ;-)

In terms of the specific example you give. You might want to look at what the actual difference is between those structures - you could try clustering each trajectory for example and see if one went through a structural change that the other didn't etc. The ultimate measure here is that, in principal, more sampling / more repetitions should never be worse than less sampling / less repetitions - if it is then you are in danger of stopping the simulation, as tempting as it is, when it gives you the result you are looking for.

So the main point is - look at ways to combine your results to determine how the property of interest you are looking at changes with changes in the amount of data present.

I hope that helps.

Others might be able to suggest some good textbooks here - I'd think of checking some basics statistics text books but nothing great springs immediately to mind.

All the best
Ross

> On Dec 17, 2015, at 2:25 PM, Morrow,John Kenneth <JKMorrow.mdanderson.org> wrote:
>
> Hello everyone, I have been modeling two small, interacting proteins and found that I can achieve significant differences in my production RMSD plots of the two proteins when I use pmemd.MPI with ig=-1. I am aware that both these settings will ensure that I have at least some differences, but these differences seem quite large.
>
> Should all of my production runs be done in triplicate for analysis? Is this normal for production runs with my settings or is my system not reaching equilibrium yet? I searched the literature and could not find much on this topic.
>
> Attached is a picture of the RMSD of three identical 32ns production runs with ig=-1
>
> John Morrow
> MD Anderson Cancer Center
> Graduate School of Biomedical Sciences
> Experimental Therapeutics
>
