Re: [AMBER] How to interpret ptraj clustering results

From: Thomas Cheatham III <tec3.utah.edu>
Date: Wed, 26 May 2010 11:54:57 -0600 (Mountain Daylight Time)

> After all 5 clusters's average structures were obtained, I also find a xxx.txt file
...
> Cluster 0: has 12 points, occurence 0.120, average-distance-to-centroid is 0.455100
> Cluster 1: has 12 points, occurence 0.120, average-distance-to-centroid is 0.484351
> Cluster 2: has 25 points, occurence 0.250, average-distance-to-centroid is 0.520670
> Cluster 3: has 29 points, occurence 0.290, average-distance-to-centroid is 0.536362
> Cluster 4: has 22 points, occurence 0.220, average-distance-to-centroid is 0.489196
>
> My question is if I want to check the statistical criteria such as sum
> of squares regression / total sum of squares. What should I do?

What statistical criteria are you referring to here? If you read the Shao
et al. reference listed in the manual, there are some sorts of statistics
reported with the clustering, however it is not so straightforward to
interpret (both the statistics and that paper!).

> How can I tell if the cluster0...cluster4 are similar or not?

What I often do is look at the 2D-RMSd plot as this "shows" the
relationship among snapshots in the trajectory. Clearly
cluster0...cluster4 are somewhat different, otherwise they would be in the
same cluster. You can look at the top of the xxx.txt file which supplies
the distribution of pairwise RMSd values; in my case (below), ~90% of the
structures are within 2.9 angstroms of each other (i.e. looking at the
first entry, .11% of the structures are < 0.516 angstroms from each other
and none is closer than 0.251 angstroms).

###Distribution of Distances###
# [ 0.251, 0.516] -- 0.11% ( 16388 out of 14561106)
# [ 0.516, 0.781] -- 1.27% (185513 out of 14561106)
# [ 0.781, 1.046] -- 2.82% (410262 out of 14561106)
# [ 1.046, 1.311] -- 5.59% (813807 out of 14561106)
# [ 1.311, 1.575] -- 9.32% (1357031 out of 14561106)
# [ 1.575, 1.840] -- 12.59% (1833125 out of 14561106)
# [ 1.840, 2.105] -- 14.37% (2092506 out of 14561106)
# [ 2.105, 2.370] -- 14.80% (2154520 out of 14561106)
# [ 2.370, 2.634] -- 14.97% (2179192 out of 14561106)
# [ 2.634, 2.899] -- 13.75% (2001672 out of 14561106)
# [ 2.899, 3.164] -- 7.73% (1125944 out of 14561106)
# [ 3.164, 3.429] -- 2.07% (302046 out of 14561106)
# [ 3.429, 3.693] -- 0.43% ( 62583 out of 14561106)
# [ 3.693, 3.958] -- 0.12% ( 17957 out of 14561106)
# [ 3.958, 4.223] -- 0.04% ( 6127 out of 14561106)
# [ 4.223, 4.488] -- 0.01% ( 1883 out of 14561106)

Also, at the bottom of that xxx.txt file, there is information like the
following:

#Clustering: divide 269824 points into 10 clusters
#Cluster 0: has 854 points, occurence 0.003
#Cluster 1: has 15390 points, occurence 0.057
#Cluster 2: has 33886 points, occurence 0.126
#Cluster 3: has 108795 points, occurence 0.403
#Cluster 4: has 56659 points, occurence 0.210
#Cluster 5: has 25730 points, occurence 0.095
#Cluster 6: has 7106 points, occurence 0.026
#Cluster 7: has 16535 points, occurence 0.061
#Cluster 8: has 356 points, occurence 0.001
#Cluster 9: has 4513 points, occurence 0.017
#Cluster 0 1 . .
#Cluster 1 899. . . .. . .
#Cluster 2 .89...157892 ... .... .. 1..3 4.
#Cluster 3 1. ..1.2899559995772 48899931792 .4894
#Cluster 4 ..999842. . 26699994
#Cluster 5 . .. . .. .1. .. ..1..7X411.. 672.. . .
#Cluster 6 .. ... ..43...11 ...... . .....
#Cluster 7 ...4........1.1 ......... .4. 5X
#Cluster 8 . ..
#Cluster 9 ..... . . .33......

This shows the information about occupancy of each cluster (in my case,
one cluster represents less than 1% of the trajectory).

A pseudo time-course of the snapshots in each trajectory is shown in the
ascii graphics. The larger the number (with " " = 0, "." < 10% and X =
100%) means the higher occupancy during that part of the trajectory (time
is on the x-axis in intervals spanning the whole trajectory). So looking
at the above, cluster 0 is only for the very beginning of the trajectory,
with a small re-visit at about 20% through the trajectory, cluster 1 has
structures from the beginning and closer to the end of the trajectory,
etc. Cluster 3 is interesting since it only occurs in the middle, and
cluster 4 is visited twice. Looking at these results, I may recluster to
~4-5 clusters instead of 10 since ~4-5 dominate.

Are these converged (i.e. fully sampled)? Likely not; to get a good
correlation would require visiting each cluster (at least the highly
populated ones) ~10x. In practice, we often cannot reach this in straight
MD. The clustering above was for 270 ns of simulation on the backbone of
a fairly constrained cyclic peptide!

> Based on the results that I got so far, should I used any one of the
> cluster's average structure to draw conclusions? It seems to me that no
> predominant cluster were found, all the occurences is 0.1 to 0.2.
> Should I expect to see only one cluster have the highest occurences if
> the system is equilibrated.

Maybe, maybe not; depends on if the structure is in conformational
exchange. There is no magic bullet or golden rule to analysis; it is your
job to try to make sense of the data, and importantly, if the data has
correlation to experiment. Normally I would consider clusters from the
later part of the trajectory more relevant, however it depends on what you
are trying to learn, how good the force field is, etc, etc. For example,
if the force field has issues, structures will move away from the
experimental structure and although they may converge to a nice structure,
this may not be representative of the "true" structure. Look at the
structures in molecular graphics to see if differences are obvious and/or
reasonable. Run multiple simulations with different initial conditions
and see if similar patterns / clusters emerge... See if analysis is
consistent over different blocks of the trajectory(s).

--tec3


_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed May 26 2010 - 11:00:05 PDT
Custom Search