Re: [AMBER] Error with using dpeaks clustering in CPPTRAJ from Daniel Roe on 2020-01-17 (Amber Archive Jan 2020)

From: Daniel Roe <daniel.r.roe.gmail.com>
Date: Fri, 17 Jan 2020 10:35:57 -0500

Hi,

On Fri, Jan 17, 2020 at 10:09 AM Yinglong Miao <yinglong.miao.gmail.com> wrote:
> Lots of the snapshots were assigned to a cluster -1, which seems to include widely distributed ligand conformations in both the bound and unbound states. How comes this “-1” cluster? Do you have any thoughts to perhaps assign these snapshots more accurately to the correct clusters?

In cpptraj clustering, a cluster assignment of -1 is "noise" (i.e.
unassigned) - this is mentioned in the manual but perhaps needs a more
prominent entry. If you're getting a lot of things assigned to noise
that means your data is very sparse and/or your clustering parameters
are too tight. How did you decide on the parameters for 'dbscan'? Did
you see the manual entry "Hints for setting DBSCAN parameters with
’kdist’"?

>
> Given what we have for dbscan, I wanted to try dpeaks, hoping it can resolve the issue. I will lower epsilon as you suggested and see if it can at least complete the calculation. From initial outputs, seems there is also a “-1” cluster though. Since we have a very large number of simulation frames for clustering, the sieve option is essential to avoid the memory problem. When would you possibly make that available for dpeaks?

Unfortunately I have a lot of things on my plate, so it's unlikely I
would get to this soon. My recommendation would be to try and get
DBSCAN working. Good general advice for clustering is to do your
initial clustering on a much smaller subset of the trajectory first
and then "tune" the parameters. So e.g. create a small trajectory that
contains only 1000 frames or so:

trajin mytraj.nc 1 last 250
trajout mytraj.1000.nc

Then do clustering on that. That will save you time while you figure
out what settings work for this system. Then you can gradually add
more frames and sieving etc. Clustering is really an art form, so be
prepared for a lot of fine tuning.

Good luck!

-Dan

>
> Thanks again,
> Yinglong
>
>
> > On Jan 17, 2020, at 8:41 AM, Daniel Roe <daniel.r.roe.gmail.com> wrote:
> >
> > PPS - Also note that adding back sieved frames isn't yet implemented
> > for 'dpeaks', so you may just want to use another clustering method.
> > If you want to stick with density based there's 'dbscan'...
> >
> > On Fri, Jan 17, 2020 at 9:37 AM Daniel Roe <daniel.r.roe.gmail.com> wrote:
> >>
> >> PS - If you're really interested, prior to the clustering command you
> >> can use 'debug <#>' (where <#> is greater than 0) to print more
> >> potentially helpful information. In the output you will see 'DBG: Max
> >> dist=' which will show the maximum distance observed between points;
> >> epsilon should be less than this. I should probably have that printed
> >> by default.
> >>
> >> Thanks for the report by the way.
> >>
> >> On Fri, Jan 17, 2020 at 9:34 AM Daniel Roe <daniel.r.roe.gmail.com> wrote:
> >>>
> >>> OK - I've been looking at this for a bit. I think that the problem
> >>> must be that all your points too close i.e. all points are within
> >>> epsilon from each other. Your dvdfile backs that up - the first column
> >>> is '#Density', which just means # of points that are within epsilon
> >>> from that point. In each case the #Density is 1249, indicating that
> >>> everyone is too tight. I think if you lower epsilon you'll start to
> >>> get better results.
> >>>
> >>> This is probably a case that cpptraj should trap. In my (limited)
> >>> defense, it does state that the 'dpeaks' implementation is under
> >>> development...
> >>>
> >>> So in summary, try lowering epsilon and see if that helps. I'll work
> >>> on an update to trap the case where epsilon is too large.
> >>>
> >>> Hope this helps,
> >>>
> >>> -Dan
> >>>
> >>> On Tue, Jan 14, 2020 at 12:34 PM <yinglong.miao.gmail.com> wrote:
> >>>>
> >>>> I have also tried the gauss option. It gave the following output:
> >>>> ACTION OUTPUT:
> >>>>
> >>>> ANALYSIS: Performing 1 analyses:
> >>>> 0: [cluster C0 dpeaks epsilon 4 dvdfile dvdfile choosepoints auto runavg
> >>>> runavg.dat deltafile delta.dat sieve 200 gauss]
> >>>> Starting clustering.
> >>>> Mask [*] corresponds to 15 atoms.
> >>>> Estimated pair-wise matrix memory usage: > 3.123 MB
> >>>> Pair-wise matrix set up with sieve, 250000 frames, 1250 sieved frames.
> >>>> Calculating pair-wise distances.
> >>>> 0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
> >>>>
> >>>> No error message was given but also no further output ...
> >>>>
> >>>> Thanks,
> >>>> Yinglong
> >>>>
> >>>>
> >>>> On Tue, Jan 14, 2020 at 9:47 AM Daniel Roe <daniel.r.roe.gmail.com> wrote:
> >>>>
> >>>>> Can you provide me (either in reply to this or off list) your entire
> >>>>> cpptraj output and the contents of dvdfile?
> >>>>>
> >>>>> This could happen with very sparse density I think, although its
> >>>>> difficult to say without exactly replicating. You could potentially
> >>>>> try the 'gauss' keyword for Gaussian density instead of discrete
> >>>>> density.
> >>>>>
> >>>>> -Dan
> >>>>>
> >>>>> On Mon, Jan 13, 2020 at 8:10 PM Yinglong Miao <yinglong.miao.gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> Hi Dan,
> >>>>>>
> >>>>>> It’s the latest version as in AMBER git repository.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Yinglong
> >>>>>>
> >>>>>>
> >>>>>>> On Jan 13, 2020, at 6:20 PM, Daniel Roe <daniel.r.roe.gmail.com>
> >>>>> wrote:
> >>>>>>>
> >>>>>>> What version of cpptraj are you using?
> >>>>>>>
> >>>>>>> -Dan
> >>>>>>>
> >>>>>>> On Mon, Jan 13, 2020 at 6:51 PM <yinglong.miao.gmail.com> wrote:
> >>>>>>>>
> >>>>>>>> Hello,
> >>>>>>>>
> >>>>>>>> I tried to use the dpeaks algorithm for clustering with the following
> >>>>>>>> command:
> >>>>>>>> cluster C0 dpeaks epsilon 4 dvdfile dvdfile choosepoints auto runavg
> >>>>>>>> runavg.dat deltafile delta.dat sieve 200
> >>>>>>>>
> >>>>>>>> But keep getting the following output with error:
> >>>>>>>> ...
> >>>>>>>> Finding closest neighbor point with higher density for each point.
> >>>>>>>> 0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
> >>>>>>>> Internal Error: In Cluster_DPeaks::AssignClusterNum nearest neighbor
> >>>>> is -1.
> >>>>>>>> Segmentation fault (core dumped)
> >>>>>>>>
> >>>>>>>> I will appreciate any suggestions that would fix this ...
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Yinglong
> >>>>>>>>
> >>>>>>>> Yinglong Miao, Ph.D.
> >>>>>>>> Assistant Professor
> >>>>>>>> Center for Computational Biology and
> >>>>>>>> Department of Molecular Biosciences
> >>>>>>>> University of Kansas
> >>>>>>>> http://miao.compbio.ku.edu
> >>>>>>>> _______________________________________________
> >>>>>>>> AMBER mailing list
> >>>>>>>> AMBER.ambermd.org
> >>>>>>>> http://lists.ambermd.org/mailman/listinfo/amber
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> AMBER mailing list
> >>>>>>> AMBER.ambermd.org
> >>>>>>> http://lists.ambermd.org/mailman/listinfo/amber
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> AMBER mailing list
> >>>>>> AMBER.ambermd.org
> >>>>>> http://lists.ambermd.org/mailman/listinfo/amber
> >>>>>
> >>>>> _______________________________________________
> >>>>> AMBER mailing list
> >>>>> AMBER.ambermd.org
> >>>>> http://lists.ambermd.org/mailman/listinfo/amber
> >>>>>
> >>>> _______________________________________________
> >>>> AMBER mailing list
> >>>> AMBER.ambermd.org
> >>>> http://lists.ambermd.org/mailman/listinfo/amber
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Jan 17 2020 - 08:00:02 PST