RE: [AMBER] How to interpret ptraj clustering results from Catein Catherine on 2010-05-27 (Amber Archive May 2010)

From: Catein Catherine <askamber23.hotmail.com>
Date: Thu, 27 May 2010 15:40:24 +0800

Dear Professor.

Yes. I understand much better now.

Thank you very much.

Best regards,

Cat

> Date: Wed, 26 May 2010 11:54:57 -0600
> From: tec3.utah.edu
> To: amber.ambermd.org
> Subject: Re: [AMBER] How to interpret ptraj clustering results
>
>
> > After all 5 clusters's average structures were obtained, I also find a xxx.txt file
> ...
> > Cluster 0: has 12 points, occurence 0.120, average-distance-to-centroid is 0.455100
> > Cluster 1: has 12 points, occurence 0.120, average-distance-to-centroid is 0.484351
> > Cluster 2: has 25 points, occurence 0.250, average-distance-to-centroid is 0.520670
> > Cluster 3: has 29 points, occurence 0.290, average-distance-to-centroid is 0.536362
> > Cluster 4: has 22 points, occurence 0.220, average-distance-to-centroid is 0.489196
> >
> > My question is if I want to check the statistical criteria such as sum
> > of squares regression / total sum of squares. What should I do?
>
> What statistical criteria are you referring to here? If you read the Shao
> et al. reference listed in the manual, there are some sorts of statistics
> reported with the clustering, however it is not so straightforward to
> interpret (both the statistics and that paper!).
>
> > How can I tell if the cluster0...cluster4 are similar or not?
>
> What I often do is look at the 2D-RMSd plot as this "shows" the
> relationship among snapshots in the trajectory. Clearly
> cluster0...cluster4 are somewhat different, otherwise they would be in the
> same cluster. You can look at the top of the xxx.txt file which supplies
> the distribution of pairwise RMSd values; in my case (below), ~90% of the
> structures are within 2.9 angstroms of each other (i.e. looking at the
> first entry, .11% of the structures are < 0.516 angstroms from each other
> and none is closer than 0.251 angstroms).
>
> ###Distribution of Distances###
> # [ 0.251, 0.516] -- 0.11% ( 16388 out of 14561106)
> # [ 0.516, 0.781] -- 1.27% (185513 out of 14561106)
> # [ 0.781, 1.046] -- 2.82% (410262 out of 14561106)
> # [ 1.046, 1.311] -- 5.59% (813807 out of 14561106)
> # [ 1.311, 1.575] -- 9.32% (1357031 out of 14561106)
> # [ 1.575, 1.840] -- 12.59% (1833125 out of 14561106)
> # [ 1.840, 2.105] -- 14.37% (2092506 out of 14561106)
> # [ 2.105, 2.370] -- 14.80% (2154520 out of 14561106)
> # [ 2.370, 2.634] -- 14.97% (2179192 out of 14561106)
> # [ 2.634, 2.899] -- 13.75% (2001672 out of 14561106)
> # [ 2.899, 3.164] -- 7.73% (1125944 out of 14561106)
> # [ 3.164, 3.429] -- 2.07% (302046 out of 14561106)
> # [ 3.429, 3.693] -- 0.43% ( 62583 out of 14561106)
> # [ 3.693, 3.958] -- 0.12% ( 17957 out of 14561106)
> # [ 3.958, 4.223] -- 0.04% ( 6127 out of 14561106)
> # [ 4.223, 4.488] -- 0.01% ( 1883 out of 14561106)
>
> Also, at the bottom of that xxx.txt file, there is information like the
> following:
>
> #Clustering: divide 269824 points into 10 clusters
> #Cluster 0: has 854 points, occurence 0.003
> #Cluster 1: has 15390 points, occurence 0.057
> #Cluster 2: has 33886 points, occurence 0.126
> #Cluster 3: has 108795 points, occurence 0.403
> #Cluster 4: has 56659 points, occurence 0.210
> #Cluster 5: has 25730 points, occurence 0.095
> #Cluster 6: has 7106 points, occurence 0.026
> #Cluster 7: has 16535 points, occurence 0.061
> #Cluster 8: has 356 points, occurence 0.001
> #Cluster 9: has 4513 points, occurence 0.017
> #Cluster 0 1 . .
> #Cluster 1 899. . . .. . .
> #Cluster 2 .89...157892 ... .... .. 1..3 4.
> #Cluster 3 1. ..1.2899559995772 48899931792 .4894
> #Cluster 4 ..999842. . 26699994
> #Cluster 5 . .. . .. .1. .. ..1..7X411.. 672.. . .
> #Cluster 6 .. ... ..43...11 ...... . .....
> #Cluster 7 ...4........1.1 ......... .4. 5X
> #Cluster 8 . ..
> #Cluster 9 ..... . . .33......
>
> This shows the information about occupancy of each cluster (in my case,
> one cluster represents less than 1% of the trajectory).
>
> A pseudo time-course of the snapshots in each trajectory is shown in the
> ascii graphics. The larger the number (with " " = 0, "." < 10% and X =
> 100%) means the higher occupancy during that part of the trajectory (time
> is on the x-axis in intervals spanning the whole trajectory). So looking
> at the above, cluster 0 is only for the very beginning of the trajectory,
> with a small re-visit at about 20% through the trajectory, cluster 1 has
> structures from the beginning and closer to the end of the trajectory,
> etc. Cluster 3 is interesting since it only occurs in the middle, and
> cluster 4 is visited twice. Looking at these results, I may recluster to
> ~4-5 clusters instead of 10 since ~4-5 dominate.
>
> Are these converged (i.e. fully sampled)? Likely not; to get a good
> correlation would require visiting each cluster (at least the highly
> populated ones) ~10x. In practice, we often cannot reach this in straight
> MD. The clustering above was for 270 ns of simulation on the backbone of
> a fairly constrained cyclic peptide!
>
> > Based on the results that I got so far, should I used any one of the
> > cluster's average structure to draw conclusions? It seems to me that no
> > predominant cluster were found, all the occurences is 0.1 to 0.2.
> > Should I expect to see only one cluster have the highest occurences if
> > the system is equilibrated.
>
> Maybe, maybe not; depends on if the structure is in conformational
> exchange. There is no magic bullet or golden rule to analysis; it is your
> job to try to make sense of the data, and importantly, if the data has
> correlation to experiment. Normally I would consider clusters from the
> later part of the trajectory more relevant, however it depends on what you
> are trying to learn, how good the force field is, etc, etc. For example,
> if the force field has issues, structures will move away from the
> experimental structure and although they may converge to a nice structure,
> this may not be representative of the "true" structure. Look at the
> structures in molecular graphics to see if differences are obvious and/or
> reasonable. Run multiple simulations with different initial conditions
> and see if similar patterns / clusters emerge... See if analysis is
> consistent over different blocks of the trajectory(s).
>
> --tec3
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_________________________________________________________________
Your E-mail and More On-the-Go. Get Windows Live Hotmail Free.
https://signup.live.com/signup.aspx?id=60969
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu May 27 2010 - 01:00:06 PDT