> I've been doing some clustering analysis on an RNA trajectory using
> ptraj (AmberTools 13). Because the trajectory is quite long, I need to
> use the sieving option, at least for some of the clustering algorithms.
>
> I've been running different algorithms and trying different cluster
> counts to compare the results by looking at the DBI and pSF statistics.
> However this values do not come out in the cluster.txt file because I'm
> using the sieve option. In the EndFistPass file, The SSR/SST value is
> <0.6 in all cases. I guess it is so low because only a subset of frames
> are included in this pass, so I'm hoping that for the whole trajectory
> SSR/SST would be closer to 1.
I do not think the SSR/SST is low because you only have a subset of the
frames, but simply that for dynamic biomolecules clustering is sloppy.
While I applaude the efforts to try to get the "best" clustering by using
metrics, I do not think with these metrics you will get much statistically
significant discrimination among the various algorithms. If you go back
to the Shao et al. paper that discusses all of these metrics, essentially
we only got "good" metrics when we used artificial datasets, e.g. where we
built artificially trajectories that contained sampling around five very
distinct geometries (from aggregating five independent trajectories). In
the real world with a biomolecule like RNA that is dynamic and moving
continuously through related conformations, clustering is not so clear cut
and it is difficult to state that one of those algorithms is "better" than
another; essentially each algorithm is making choices as to where the
boundaries between clusters are and it is difficult to state that one is
clearly better than another.
This is why with CPPTRAJ we haven't invested considerable efforts to get a
myriad of algorithms implemented and instead focused on speed.
Rather than focus on those metrics, I would suggest a more practical
approach. We normally look at 2D-RMSd plots to get an idea of how many
clusters there may be and then use something like averagelinkage. If
metrics are key, try doing fewer frames (without sieve) and compare with
different offets to the data.
If you are concerned about the effects of sieving, try different start
frames (start #) or compare to multiple runs with different random
sievings (random) and see what the influence is on the clustering. My
guess is that you will find similar clusters and representatives between
the different runs and that differences with different sieves will be
comparable to differences from choice of different clustering algorithms,
i.e. there will be no clear way of distinguishing among the similar
algorithms (averagelinkage, means).
> I read in the AmberTools Manual that "The DBI and pSF values (SSR/SST
> too?) for a sieving algorithm can be calculated later by running the
> ptraj clustering again, using “DBI” as the algorithm. This will read the
> clustering result from the “filename.txt” and appended the DBI and pSF
> values to the file "filename.txt”".
> prnlev 5
> trajin ../../../40C_1N8R1-15.1.dry.trj
> cluster out cluster40C.1_1N8R_means representative pdb \
> average pdb means sieve 100 clusters 15 rms mass :2-20
using DBI as the "algorithm" would mean not specifying means by replacing
the means keyword by dbi, if it in fact still works... When you use DBI
as the algorithm, do not specify the sieve (or do the output of
structures).
cluster out cluster40C.1_1N8R_means dbi clusters 15 rms mass :2-20
Note that this will only work if you have < 50,000 frames unless you alter
the code, so likely my earlier approach is recommended...
--tec3
p.s. note that I just did a clustering with two different random sieves
and got essentially identical average structures noting that the ordering
of the clusters may in fact differ...
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Jun 21 2013 - 09:30:10 PDT