- Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]

From: Daniel Roe <daniel.r.roe.gmail.com>

Date: Thu, 7 May 2015 10:12:11 -0600

Hi,

On Thu, May 7, 2015 at 3:29 AM, Juan Eiros Zamora

<j.eiros-zamora14.imperial.ac.uk> wrote:

*> I have read the cluster documentation in the manual but I'm failing to
*

*> understand how the pairwise distance matrix works for the clustering. In
*

*> one of the examples, the clustering is done using the "distance" command
*

*> between two residues, and then clustering based on that distance (if I'm
*

*> not mistaken) as such:
*

There are two "distances" in the example. In the context of

clustering, "distances" are just the measures of similarity of one

data point to another. The distance (i.e. similarity) metric can be

either coordinate-based (e.g. RMSD, DME) or based on data derived from

coordinate frames (e.g. geometric distance, radius of gyration, etc).

*> Example: cluster on a specific distance:
*

*> distance endToEnd :1 :255
*

*> cluster data endToEnd clusters 10 epsilon 3.0 summary summary.dat info info.dat
*

In this case the clustering distance metric is 'data', which just

means "use the provided data set". So say in the given example we're

clustering on 4 frames. The data set 'endToEnd' will contain the

through-space distance between residues 1 and 255 for each frame,

which may look something like:

20.0

21.4

21.1

20.3

Then for the purposes of clustering, the distance between frames is

taken as the Euclidean distance. So the "distance" for clustering

between frames 1 and 2 would be 1.4, etc. If your metric were RMSD,

then you would be calculating the RMSD between frames 1 and 2.

*> If I want to cluster only a specific region of my system (residues 232
*

*> to 248, for instance), I should not follow the example above right? So,
*

*> the first time the command would be something like this:
*

*>
*

*> cluster dbscan minpoints 4 epsilon 3.5 rms savepairdist pardist matrixfile
*

Well, if you only want to cluster on RMSD of residues 232 to 248 you

need to put that atom mask in there (otherwise you will use all

atoms). Also 'pardist' should be 'pairdist', so:

cluster dbscan minpoints 4 epsilon 3.5 rms :232-248 savepairdist

pairdist matrixfile

*> And then for the subsequent clustering trials I should use
*

*>
*

*> cluster dbscan minpoints 4 epsilon X rms loadpairdist matrixfile #Change
*

The 'pairdist' keyword specifies the pairwise distance matrix file

name, so you want:

cluster dbscan minpoints 4 epsilon X rms :232-248 loadpairdist

pairdist matrixfile

*> Does the pairwise matrix that is saved vary based on the cluster
*

*> algorithm that is used? If not, could I use the same parwise distance
*

*> matrix to try out different algorithms?
*

The pairwise distance matrix only depends on the clustering metric and

the input frames. So as long as your input frames and distance metric

are the same, you can re-use the pairwise distance matrix file. This

is in fact why cpptraj does not automatically save the pairwise

distance matrix file, because if you re-use it and choose different

metrics you will get incorrect results (cpptraj does what checking it

can to ensure the pairwise distance file is ok but doesnt catch

everything).

Hope this helps,

-Dan

Date: Thu, 7 May 2015 10:12:11 -0600

Hi,

On Thu, May 7, 2015 at 3:29 AM, Juan Eiros Zamora

<j.eiros-zamora14.imperial.ac.uk> wrote:

There are two "distances" in the example. In the context of

clustering, "distances" are just the measures of similarity of one

data point to another. The distance (i.e. similarity) metric can be

either coordinate-based (e.g. RMSD, DME) or based on data derived from

coordinate frames (e.g. geometric distance, radius of gyration, etc).

In this case the clustering distance metric is 'data', which just

means "use the provided data set". So say in the given example we're

clustering on 4 frames. The data set 'endToEnd' will contain the

through-space distance between residues 1 and 255 for each frame,

which may look something like:

20.0

21.4

21.1

20.3

Then for the purposes of clustering, the distance between frames is

taken as the Euclidean distance. So the "distance" for clustering

between frames 1 and 2 would be 1.4, etc. If your metric were RMSD,

then you would be calculating the RMSD between frames 1 and 2.

Well, if you only want to cluster on RMSD of residues 232 to 248 you

need to put that atom mask in there (otherwise you will use all

atoms). Also 'pardist' should be 'pairdist', so:

cluster dbscan minpoints 4 epsilon 3.5 rms :232-248 savepairdist

pairdist matrixfile

The 'pairdist' keyword specifies the pairwise distance matrix file

name, so you want:

cluster dbscan minpoints 4 epsilon X rms :232-248 loadpairdist

pairdist matrixfile

The pairwise distance matrix only depends on the clustering metric and

the input frames. So as long as your input frames and distance metric

are the same, you can re-use the pairwise distance matrix file. This

is in fact why cpptraj does not automatically save the pairwise

distance matrix file, because if you re-use it and choose different

metrics you will get incorrect results (cpptraj does what checking it

can to ensure the pairwise distance file is ok but doesnt catch

everything).

Hope this helps,

-Dan

-- ------------------------- Daniel R. Roe, PhD Department of Medicinal Chemistry University of Utah 30 South 2000 East, Room 307 Salt Lake City, UT 84112-5820 http://home.chpc.utah.edu/~cheatham/ (801) 587-9652 (801) 585-6208 (Fax) _______________________________________________ AMBER mailing list AMBER.ambermd.org http://lists.ambermd.org/mailman/listinfo/amberReceived on Thu May 07 2015 - 09:30:07 PDT

Custom Search