Re: [AMBER] Trouble understanding DBSCAN clustering algorithm from Daniel Roe on 2015-05-07 (Amber Archive May 2015)

From: Daniel Roe <daniel.r.roe.gmail.com>
Date: Thu, 7 May 2015 10:12:11 -0600

Hi,

On Thu, May 7, 2015 at 3:29 AM, Juan Eiros Zamora
<j.eiros-zamora14.imperial.ac.uk> wrote:
> I have read the cluster documentation in the manual but I'm failing to
> understand how the pairwise distance matrix works for the clustering. In
> one of the examples, the clustering is done using the "distance" command
> between two residues, and then clustering based on that distance (if I'm
> not mistaken) as such:

There are two "distances" in the example. In the context of
clustering, "distances" are just the measures of similarity of one
data point to another. The distance (i.e. similarity) metric can be
either coordinate-based (e.g. RMSD, DME) or based on data derived from
coordinate frames (e.g. geometric distance, radius of gyration, etc).

> Example: cluster on a specific distance:
> distance endToEnd :1 :255
> cluster data endToEnd clusters 10 epsilon 3.0 summary summary.dat info info.dat

In this case the clustering distance metric is 'data', which just
means "use the provided data set". So say in the given example we're
clustering on 4 frames. The data set 'endToEnd' will contain the
through-space distance between residues 1 and 255 for each frame,
which may look something like:

20.0
21.4
21.1
20.3

Then for the purposes of clustering, the distance between frames is
taken as the Euclidean distance. So the "distance" for clustering
between frames 1 and 2 would be 1.4, etc. If your metric were RMSD,
then you would be calculating the RMSD between frames 1 and 2.

> If I want to cluster only a specific region of my system (residues 232
> to 248, for instance), I should not follow the example above right? So,
> the first time the command would be something like this:
>
> cluster dbscan minpoints 4 epsilon 3.5 rms savepairdist pardist matrixfile

Well, if you only want to cluster on RMSD of residues 232 to 248 you
need to put that atom mask in there (otherwise you will use all
atoms). Also 'pardist' should be 'pairdist', so:

cluster dbscan minpoints 4 epsilon 3.5 rms :232-248 savepairdist
pairdist matrixfile

> And then for the subsequent clustering trials I should use
>
> cluster dbscan minpoints 4 epsilon X rms loadpairdist matrixfile #Change

The 'pairdist' keyword specifies the pairwise distance matrix file
name, so you want:

cluster dbscan minpoints 4 epsilon X rms :232-248 loadpairdist
pairdist matrixfile

> Does the pairwise matrix that is saved vary based on the cluster
> algorithm that is used? If not, could I use the same parwise distance
> matrix to try out different algorithms?

The pairwise distance matrix only depends on the clustering metric and
the input frames. So as long as your input frames and distance metric
are the same, you can re-use the pairwise distance matrix file. This
is in fact why cpptraj does not automatically save the pairwise
distance matrix file, because if you re-use it and choose different
metrics you will get incorrect results (cpptraj does what checking it
can to ensure the pairwise distance file is ok but doesnt catch
everything).

Hope this helps,

-Dan

-- 
-------------------------
Daniel R. Roe, PhD
Department of Medicinal Chemistry
University of Utah
30 South 2000 East, Room 307
Salt Lake City, UT 84112-5820
http://home.chpc.utah.edu/~cheatham/
(801) 587-9652
(801) 585-6208 (Fax)
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Thu May 07 2015 - 09:30:07 PDT