So I probably sound like a broken record at this point, but I'll say
it again: clustering is very much an art form, requiring a lot of
trial and error before you get "optimal" results. In general, you want
to pay attention to your clustering success metrics which get printed
in the 'info' file, namely the Davies-Bouldin index, pseudo-F, and the
SSR/SST ratio, all of which are discussed in the manual entry for the
'cluster' command. What I like to do is take a reduced set of my
trajectory (e.g. every 10th frame or so - pick a number so the
clustering finishes in a reasonably short amount of time), then run
cluster analysis while adjusting the clustering parameters until I'm
satisfied with the values of my clustering success metrics. Saving and
loading the pairwise distance matrix with the 'savepairdist' and
'loadpairdist' keywords will speed this up by caching the pairwise
distance matrix in a file (by default named CpptrajPairDist). So some
sample input might look something like:
parm tz2.parm7
trajin tz2.nc
for N=2;N<7;N++
cluster C$N kmeans :1-12.N,CA,C,O out clusters $N info info.$N.dat
pairdist MyDist savepairdist loadpairdist
You can then examine the DBI and pseudo-F values in the info.X.dat
files and decide which # of clusters is "optimal" for your
Note you need an up-to-date version of cpptraj (AmberTools 23 or
https://github.com/Amber-MD/cpptraj) to ensure the pairwise
data set caching will work properly between subsequent runs.
Hope this helps,
On Mon, May 15, 2023 at 10:43 AM Daniel Hall via AMBER
<amber.ambermd.org> wrote:
> Hello amber users!
> I am trying to perform clustering analysis using AMBER18 by following this tutorial:
> https://amberhub.chpc.utah.edu/clustering-a-protein-trajectory/
> But I am having difficulty determining the cluster size, could anyone advise how to do this.
> My “analysis.in” file is:
> ###
> parm ../../../stripped.1va3_solv.prmtop
> trajin ../../../ensemble_nowat_100ns.nc
> cluster c1 \
> kmeans clusters 100 randompoint maxit 500 \
> rms :1-29.CA \
> sieve 10 random \
> out cnumvtime.dat \
> summary summary.dat \
> info info.dat \
> cpopvtime cpopvtime.agr normframe \
> repout rep repfmt pdb \
> singlerepout singlerep.nc singlerepfmt netcdf \
> avgout avg avgfmt pdb
> run
> ###
> Any help would be greatly appreciated! 😊
> Kind regards,
> Daniel.
