- Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]

From: Jason Swails <jason.swails.gmail.com>

Date: Tue, 10 Mar 2015 10:15:20 -0400

On Tue, 2015-03-10 at 11:37 +0000, Juan Eiros Zamora wrote:

*> Hi all,
*

*>
*

*> I would like to hop on in this discussion. I have been using this
*

*> procedure to analyze my classical MD trajectories using the following
*

*> cpptraj script
*

*>
*

*> ~~~~
*

*> parm *.prmtop
*

*> trajin *.nc
*

*>
*

*> rms first
*

*> average avg.nc
*

*> run
*

*>
*

*> reference avg.nc [ref1]
*

*> matrix covar name matrixdat .CA out covmat-ca.dat
*

*> diagmatrix matrixdat out evecs-ca.dat vecs 10
*

*> run
*

*>
*

*> projection modes evecs-ca.dat out pca-ca.dat beg 1 end 2 .CA
*

*> run
*

*>
*

*> readdata pca-ca.dat
*

*> runanalysis hist pca-ca.dat:2 pca-ca.dat:3 free 300 out fhist-all.CA.gnu
*

*> bins 400
*

*> ~~~~
*

*>
*

*> I am a bit confused with the settings that have to be used for PCA
*

*> analysis. To my understanding, the first few PC are responsible for the
*

*> majority of the variance of the system. Therefore, is there a reason why
*

*> you calculate 174 PCs?
*

This is true, but it's often interesting to quantify *how much* of the

variance is accounted for by the first few PCs. This is easy enough to

do if you have all of the eigenvalues (or all of the non-negligible

eigenvalues). You can get a percentage of the variance accounted for by

the first N eigenvectors by dividing the sum of the first N eigenvalues

by the sum of all of them.

*> I thought that maybe only calculating 10 would be
*

*> enough? Also, why do you set beg 1 end 3 in your projection function? If
*

*> one wants to plot the 1st PC vs. the 2nd PC only beg 1 end 2 is needed I
*

*> presume.
*

Yes, I think you would only need beg 1 end 2 if you had no interest in

looking at the third eigenvector. But it doesn't cost that much to

compute it and print it.

*> Additionally, I am not quite sure how the last line of code works. I do
*

*> not understand how changing the default bin size affects the
*

*> calculation, should I just leave it as default (erase it from my script)
*

*> if I want to analyze classical MD trajectory?
*

This is standard histogramming. The bin size is intimately related to

the populations of those bins. For instance, suppose your bin size is

WAY too small. In this case, the spacing between the bins is so small

that the chances of two points residing in the same bin is very low. So

all you have is a set of bins with some of those bins having an

occupancy of 1, and the rest having an occupancy of 0. This gives you

no additional information than a table with the list of points.

On the other hand, assume your whole grid was one large bin. Then every

point fits in that same bin, and again gives you no useful information.

There is a "sweet spot" where the bin size is small enough to give you a

reasonably smooth and meaningful density, yet large enough to avoid

making the density too noisy. There are "rules" for picking a bin width

automatically (see

http://en.wikipedia.org/wiki/Histogram#Number_of_bins_and_width for a

list -- I'm most familiar with Scott's rule). But these are empirical

suggestions that tend to work well for certain types of data

distributions.

This bin width dependence is a reason why many people prefer a bin-free

alternative, like the kernel density estimator. In this case, each

point is "blurred" according to its kernel (often a Gaussian function),

so that the density you get is a continuous function of overlapping

kernels. The bin width analogy here is the bandwidth (how "wide" the

kernel is -- in the case of the Gaussian kernel the variance), and it

behaves in many respects like the histogram bin width. Although in my

experience, KDEs behave better than histograms for small-to-intermediate

amounts of data.

HTH,

Jason

Date: Tue, 10 Mar 2015 10:15:20 -0400

On Tue, 2015-03-10 at 11:37 +0000, Juan Eiros Zamora wrote:

This is true, but it's often interesting to quantify *how much* of the

variance is accounted for by the first few PCs. This is easy enough to

do if you have all of the eigenvalues (or all of the non-negligible

eigenvalues). You can get a percentage of the variance accounted for by

the first N eigenvectors by dividing the sum of the first N eigenvalues

by the sum of all of them.

Yes, I think you would only need beg 1 end 2 if you had no interest in

looking at the third eigenvector. But it doesn't cost that much to

compute it and print it.

This is standard histogramming. The bin size is intimately related to

the populations of those bins. For instance, suppose your bin size is

WAY too small. In this case, the spacing between the bins is so small

that the chances of two points residing in the same bin is very low. So

all you have is a set of bins with some of those bins having an

occupancy of 1, and the rest having an occupancy of 0. This gives you

no additional information than a table with the list of points.

On the other hand, assume your whole grid was one large bin. Then every

point fits in that same bin, and again gives you no useful information.

There is a "sweet spot" where the bin size is small enough to give you a

reasonably smooth and meaningful density, yet large enough to avoid

making the density too noisy. There are "rules" for picking a bin width

automatically (see

http://en.wikipedia.org/wiki/Histogram#Number_of_bins_and_width for a

list -- I'm most familiar with Scott's rule). But these are empirical

suggestions that tend to work well for certain types of data

distributions.

This bin width dependence is a reason why many people prefer a bin-free

alternative, like the kernel density estimator. In this case, each

point is "blurred" according to its kernel (often a Gaussian function),

so that the density you get is a continuous function of overlapping

kernels. The bin width analogy here is the bandwidth (how "wide" the

kernel is -- in the case of the Gaussian kernel the variance), and it

behaves in many respects like the histogram bin width. Although in my

experience, KDEs behave better than histograms for small-to-intermediate

amounts of data.

HTH,

Jason

-- Jason M. Swails BioMaPS, Rutgers University Postdoctoral Researcher _______________________________________________ AMBER mailing list AMBER.ambermd.org http://lists.ambermd.org/mailman/listinfo/amberReceived on Tue Mar 10 2015 - 07:30:03 PDT

Custom Search