Re: [AMBER] PCA_representation_in_Cpptraj

From: Jason Swails <jason.swails.gmail.com>
Date: Tue, 10 Mar 2015 10:15:20 -0400

On Tue, 2015-03-10 at 11:37 +0000, Juan Eiros Zamora wrote:
> Hi all,
>
> I would like to hop on in this discussion. I have been using this
> procedure to analyze my classical MD trajectories using the following
> cpptraj script
>
> ~~~~
> parm *.prmtop
> trajin *.nc
>
> rms first
> average avg.nc
> run
>
> reference avg.nc [ref1]
> matrix covar name matrixdat .CA out covmat-ca.dat
> diagmatrix matrixdat out evecs-ca.dat vecs 10
> run
>
> projection modes evecs-ca.dat out pca-ca.dat beg 1 end 2 .CA
> run
>
> readdata pca-ca.dat
> runanalysis hist pca-ca.dat:2 pca-ca.dat:3 free 300 out fhist-all.CA.gnu
> bins 400
> ~~~~
>
> I am a bit confused with the settings that have to be used for PCA
> analysis. To my understanding, the first few PC are responsible for the
> majority of the variance of the system. Therefore, is there a reason why
> you calculate 174 PCs?

This is true, but it's often interesting to quantify *how much* of the
variance is accounted for by the first few PCs. This is easy enough to
do if you have all of the eigenvalues (or all of the non-negligible
eigenvalues). You can get a percentage of the variance accounted for by
the first N eigenvectors by dividing the sum of the first N eigenvalues
by the sum of all of them.

> I thought that maybe only calculating 10 would be
> enough? Also, why do you set beg 1 end 3 in your projection function? If
> one wants to plot the 1st PC vs. the 2nd PC only beg 1 end 2 is needed I
> presume.

Yes, I think you would only need beg 1 end 2 if you had no interest in
looking at the third eigenvector. But it doesn't cost that much to
compute it and print it.

> Additionally, I am not quite sure how the last line of code works. I do
> not understand how changing the default bin size affects the
> calculation, should I just leave it as default (erase it from my script)
> if I want to analyze classical MD trajectory?

This is standard histogramming. The bin size is intimately related to
the populations of those bins. For instance, suppose your bin size is
WAY too small. In this case, the spacing between the bins is so small
that the chances of two points residing in the same bin is very low. So
all you have is a set of bins with some of those bins having an
occupancy of 1, and the rest having an occupancy of 0. This gives you
no additional information than a table with the list of points.

On the other hand, assume your whole grid was one large bin. Then every
point fits in that same bin, and again gives you no useful information.
There is a "sweet spot" where the bin size is small enough to give you a
reasonably smooth and meaningful density, yet large enough to avoid
making the density too noisy. There are "rules" for picking a bin width
automatically (see
http://en.wikipedia.org/wiki/Histogram#Number_of_bins_and_width for a
list -- I'm most familiar with Scott's rule). But these are empirical
suggestions that tend to work well for certain types of data
distributions.

This bin width dependence is a reason why many people prefer a bin-free
alternative, like the kernel density estimator. In this case, each
point is "blurred" according to its kernel (often a Gaussian function),
so that the density you get is a continuous function of overlapping
kernels. The bin width analogy here is the bandwidth (how "wide" the
kernel is -- in the case of the Gaussian kernel the variance), and it
behaves in many respects like the histogram bin width. Although in my
experience, KDEs behave better than histograms for small-to-intermediate
amounts of data.

HTH,
Jason

-- 
Jason M. Swails
BioMaPS,
Rutgers University
Postdoctoral Researcher
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Mar 10 2015 - 07:30:03 PDT
Custom Search