# Re: [AMBER] How to interpret clustering

From: Daniel Roe <daniel.r.roe.gmail.com>
Date: Wed, 1 Dec 2021 08:39:23 -0500

Hi,

Welcome to the wild world of cluster analysis, where everything's made
up and the points don't matter. (Sorry for the "Whose Line Is It
Anyway" reference, I couldn't resist:
https://en.wikipedia.org/wiki/Whose_Line_Is_It_Anyway%3F)

First off, if you haven't already I recommend that you read the
classic Shao Cheatham et al. paper on clustering MD simulation data:
https://pubs.acs.org/doi/10.1021/ct700119m

Second, to answer your question, since you've requested that the
cluster population vs time (frame) plots be normalized by frame (via
'normframe') each data point at frame X is just:

ClusterNpopVsTime(X) = PopulationOfClusterN / X

The reason the population of cluster 9 starts at 1 is the first frame
is in cluster 9. After about 50 frames or so it looks like (the colors
are a bit washed out in the picture) the structure falls into cluster
7, so the population of that cluster starts to rise and the population
of 9 starts to fall. As the simulation continues the structure falls
into cluster 2, then 5, and so on and so on. By the end of the
simulation the final values of the cluster population vs time plot
should match up with the final cluster population values from the
'summary' file. Cpptraj sorts clusters by population, so cluster 0 is
just the cluster that ends up having the most structures assigned to
it (the cluster 'centroid' is a much different concept and is unique
to each cluster).

Hope this helps,

-Dan

On Fri, Nov 26, 2021 at 11:16 AM Sadaf Rani <sadafrani6.gmail.com> wrote:
>
> Dear Amber users
> I have performed two independent trajectory runs of a protein-ligand
> complex for 100ns. To see the convergence I performed cluster analysis on
> two trajectories each consisting of 10000 frames using k mean clustering
> algorithm as shown in the tutorial
> https://amberhub.chpc.utah.edu/clustering-a-protein-trajectory/
> I got the files cnumvtime, cpovtime along with singlerep.nc by using the
> following command line in cpptraj:-
> parm complex_wild_nowat.prmtop
> traj Dup1_combi_nowat.nc 1 last 10
> trajin Dup1_combi_nowat.nc 1 last 10
> trajin Heat7_nowat.nc 1 last 10
> strip :Na+
> cluster c1 kmeans clusters 10 randompoint maxit 500 rms :1-492.C,N,O,CA,CB&!.H=
> sieve 10 random out cnumvtime.dat summary summary.dat info info.dat
> cpopvtime cpopvtime.agr normframe repout rep repfmt pdb singlerepout
> singlerep.nc singlerepfmt netcdf avgout avg avgfmt pdb
> run
>
> I am new to cluster analysis I am finding it difficult to interpret. I need
> your kind suggestions to understand the results. From cpovtime I can see 10
> clusters of the structure in which pop:9 should broader area under curve as
> compared to Pop=0. Does it mean that the Pop=0 is the most centroid
> cluster?
> How can I use this information to analyze protein-ligand interaction?
> Thanks in advance.
> Regards