Hi,
There's really no one way to determine cluster size. The input I
posted to you before does clustering for several cluster size values
and leaves it to you to determine which one is optimal based on your
clustering "goodness" metrics (DBI, pseudo F, etc). You can sometimes
get a sense for how many clusters you might need by looking at a 2D
rms plot, but that takes a bit of practice (although I'm willing to
bet Tom Cheatham could look at a 2drms plot and instantly determine
the optimal number of clusters...).
-Dan
On Tue, Jun 6, 2023 at 6:38 AM Daniel Hall <D.Hall1.bradford.ac.uk> wrote:
>
> Hello Daniel,
>
> Sorry for the late response. I have been playing with the cluster parameters.
>
> I don't understand fully how to calculate the cluster size as shown in the clustering example "Clustering a protein from multiple independent copies"
>
> How do you generate that data?
>
> Regards,
> Dan.
> ________________________________
> From: Daniel Roe <daniel.r.roe.gmail.com>
> Sent: 18 May 2023 16:03
> To: Daniel Hall <D.Hall1.bradford.ac.uk>; AMBER Mailing List <amber.ambermd.org>
> Subject: Re: [AMBER] Cluster size
>
> Caution External Email: Do not click any links or open any attachments unless you trust the sender and know that the content is safe.
>
> Hi,
>
> So I probably sound like a broken record at this point, but I'll say
> it again: clustering is very much an art form, requiring a lot of
> trial and error before you get "optimal" results. In general, you want
> to pay attention to your clustering success metrics which get printed
> in the 'info' file, namely the Davies-Bouldin index, pseudo-F, and the
> SSR/SST ratio, all of which are discussed in the manual entry for the
> 'cluster' command. What I like to do is take a reduced set of my
> trajectory (e.g. every 10th frame or so - pick a number so the
> clustering finishes in a reasonably short amount of time), then run
> cluster analysis while adjusting the clustering parameters until I'm
> satisfied with the values of my clustering success metrics. Saving and
> loading the pairwise distance matrix with the 'savepairdist' and
> 'loadpairdist' keywords will speed this up by caching the pairwise
> distance matrix in a file (by default named CpptrajPairDist). So some
> sample input might look something like:
>
> parm tz2.parm7
> trajin tz2.nc
> for N=2;N<7;N++
> cluster C$N kmeans :1-12.N,CA,C,O out clusters $N info info.$N.dat
> pairdist MyDist savepairdist loadpairdist
> done
>
> You can then examine the DBI and pseudo-F values in the info.X.dat
> files and decide which # of clusters is "optimal" for your
> system/metric.
>
> Note you need an up-to-date version of cpptraj (AmberTools 23 or
> GitHub, https://github.com/Amber-MD/cpptraj) to ensure the pairwise
> data set caching will work properly between subsequent runs.
>
> Hope this helps,
>
> -Dan
>
> On Mon, May 15, 2023 at 10:43 AM Daniel Hall via AMBER
> <amber.ambermd.org> wrote:
> >
> > Hello amber users!
> >
> > I am trying to perform clustering analysis using AMBER18 by following this tutorial:
> > https://amberhub.chpc.utah.edu/clustering-a-protein-trajectory/
> >
> > But I am having difficulty determining the cluster size, could anyone advise how to do this.
> >
> > My “analysis.in” file is:
> > ###
> > parm ../../../stripped.1va3_solv.prmtop
> > trajin ../../../ensemble_nowat_100ns.nc
> > cluster c1 \
> > kmeans clusters 100 randompoint maxit 500 \
> > rms :1-29.CA \
> > sieve 10 random \
> > out cnumvtime.dat \
> > summary summary.dat \
> > info info.dat \
> > cpopvtime cpopvtime.agr normframe \
> > repout rep repfmt pdb \
> > singlerepout singlerep.nc singlerepfmt netcdf \
> > avgout avg avgfmt pdb
> > run
> > ###
> >
> > Any help would be greatly appreciated! 😊
> >
> > Kind regards,
> > Daniel.
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Jun 09 2023 - 08:30:02 PDT