Re: [AMBER] Trouble understanding DBSCAN clustering algorithm from Daniel Roe on 2015-05-06 (Amber Archive May 2015)

From: Daniel Roe <daniel.r.roe.gmail.com>
Date: Wed, 6 May 2015 10:49:29 -0600

Hi,

On Wed, May 6, 2015 at 10:22 AM, Juan Eiros Zamora
<j.eiros-zamora14.imperial.ac.uk> wrote:
> From what I understand, now epsilon should be chosen as the Y value of the
> "K-dist" graph where the slope flattens out, and minpoints is the value of
> K?

Yes.

> The dimensions of an MD data set is 3 (tridimensional space) so K should
> always be set to >= Dimensions + 1?
>
> From the Amber manual and the original DBSCAN paper, both suggest K to be 4
> (although in the original paper they mention 4 should be for 2 dimensional
> data); but from my graphs I see that changing the K value also makes the
> Epsilon value vary substantially (the bending point changes).

Like all clustering methods, choosing parameters for DBSCAN can be a
bit of an art form. If your pairwise distance matrix has well defined
points of high density DBSCAN will work well. If you don't (because
e.g. the changes are very subtle, like maybe the flip of a single side
chain out of 400+ residues) then it may not because the density is
uniform. You can see that as you're increasing minpoints in your case
that the resulting density profile is flattening out, which indicates
values of K past 10 may not be worth looking into. What I would do in
your case is save the pairwise distance matrix with 'savepairdist' so
I can use it over and over, cluster at least 3-4 times using different
values of K and epsilon, then compare the results using metrics like
DBI, pseudo-F, silhouette etc.

> I also did a quick literature search on DBSCAN use in MD analysis, and I
> saw that in the following paper
> <http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3893832/> the minpoints is set
> to be 25, but I can't find in the paper or its Supporting Information any
> "K-dist" plot. Does this mean that the 0.9 value for epsilon was taken from
> a Kdist.25 plot?

I believe Niel chose these values purely through trial and error
(Niel, correct me if I'm wrong). This was before we were generating
K-dist plots on a regular basis. The K-dist plots can only aid in
choosing values; there is still some trial and error involved.

You may want to also compare results from different clustering metrics
(like kmeans, now available in cpptraj from AmberTools 15).

Hope this helps,

-Dan

>
> Any comments on this matter will be greatly appreciated.
>
>
> Best regards,
>
> Juan Eiros
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>

-- 
-------------------------
Daniel R. Roe, PhD
Department of Medicinal Chemistry
University of Utah
30 South 2000 East, Room 307
Salt Lake City, UT 84112-5820
http://home.chpc.utah.edu/~cheatham/
(801) 587-9652
(801) 585-6208 (Fax)
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Wed May 06 2015 - 10:00:03 PDT