Re: [AMBER] CPPTRAJ - KMeans Value for K

From: Christina Bergonzo <>
Date: Tue, 9 Feb 2021 08:45:22 -0500


I am not familiar with the 'elbow method' but I can tell you how I decide
my k values.

In Cpptraj, I first run kmeans clustering using the rms flag I like, and *save
the pair distance file*.
For this initial clustering I'll tell kmeans to assign 10 clusters.
I'll also run this clustering on a subset of my trajectory now, so it's

After this is done, I'll run another loop where I increment the number of
clusters I tell kmeans to assign, typically from 11 up through 20, and *load
the pair dist file* to make this go quickly.

By running the clustering many times with different clusters, I get a good
idea of both the Davies-Bouldon index and pseudo F, which I try and
minimize and maximize (look at the manual for specifics about what these
mean - they are reported in the log file or info file as 'DBI' and 'pSF').
I plot them on a scatter plot, and choose the best fit.
I also check out the summary file, and if lots of structures/frames are in
the top 2 clusters, I might decrease the loop - so, re-process and set
kmeans clusters = 2 through 9.

Then, I go back and cluster my entire trajectory using the number of
clusters I have determined.

A script might look like the following, of course with the rms/residues
changed to match your system:


cpptraj <<EOF
parm parmtop
trajin 1 last 100
rms fit :1-4,:12-15
cluster kmeans clusters 10 randompoint nofit kseed 821 rms mass :5-11&!.H=
sieve 2 random sieveseed 193 out cvt.10.dat summary summary.10.dat repout
rep repfmt pdb cpopvtime cpop.10.agr normframe savepairdist info info.10.dat

for ((i=11; i<20; i++)) ; do
  echo DBG $i
  cpptraj <<EOF
parm parmtop
trajin 1 last 100
rms fit :1-4,:12-15
cluster kmeans clusters $i randompoint nofit kseed 821 rms mass :5-11&!.H=
sieve 2 random sieveseed 193 out cvt.$i.dat summary summary.$i.dat
cpopvtime cpop.$i.dat normframe loadpairdist info info.$i.dat

Hope this helps,

On Mon, Feb 8, 2021 at 1:06 PM Aanshi Gandhi <>

> Hello,
> I am using CPPTRAJ for clustering analysis on my trajectory files post
> simulation using the KMeans algorithm. However I am unsure of the best
> approach in deciding what number to denote for k. While doing some online
> reading, I came across a method “Elbow method” where there is an algorithm
> that sorts through all the values of k and tells you the “best” value to
> use (based on the lowest amount of variation). When I tried using this on
> Python, it seems as though it would only work on 2D arrays (whereas I
> believe the trajectory files are 3D).
> I was wondering if you had any suggestions of similar algorithms cpptraj
> might have to assist or any other options for me to try?
> Thank you for your help,
> Aanshi Gandhi (she/her)
> M.ASc Candidate, Garton Lab
> Email<> | LinkedIn<
> _______________________________________________
> AMBER mailing list

Christina Bergonzo
Research Chemist
Biomolecular Measurement Division, MML, NIST
AMBER mailing list
Received on Tue Feb 09 2021 - 06:00:04 PST
Custom Search