[scikit-learn] Accessing Clustering Feature Tree in Birch
Roman Yurchak
rth.yurchak at gmail.com
Mon Oct 2 05:14:52 EDT 2017
Hello,
sklearn.cluster.Birch follows the original BIRCH paper, that appears to
be mostly focused on efficiently building the hierarchical clustering
tree (and not so much on making the later analysis user friendly). The
attributes exposed by Birch are those that could be reasonably exposed
given the scikit-learn API constraints. Though, one does have access to
the full cluster hierarchy via the Birch.root_.
As Joel said, traversing the tree is a standard CS problem, and there is
also probably a number of operations that could be done with it,
depending on the application. For instance, for my use case, I found
that re-constructing the Birch hierarchy using a custom container class
for each subcluster was the easiest to run subsequent analysis with. A
detailed example can be found here,
http://freediscovery.io/doc/stable/python/examples/birch_cluster_hierarchy.html
Alternatively, I wonder if converting the tree to a format readable by
some tree/graph specialized library (e.g. networkx) could be useful for
analysis.
Generally there is a number of places in scikit-learn where trees are
used (Birch, AgglomerativeClustering, tree bases classifiers, etc) but
for now there is no way to export the constructed tree to some standard
format (apart for sklearn.tree.export_graphviz). Not sure if this is
realistically achievable though..
--
Roman
On 20/09/17 13:40, Sema Atasever wrote:
> I need this information to use it in a scientific study and
> I think that a function interface would make this easier.
>
> Thank you for your answer.
>
> On Sat, Sep 16, 2017 at 1:53 PM, Joel Nothman <joel.nothman at gmail.com
> <mailto:joel.nothman at gmail.com>> wrote:
>
> There is no such thing as "the data samples in this cluster". The
> point of Birch being online is that it loses any reference to the
> individual samples that contributed to each node, but stores some
> statistics on their basis. Roman Yurchak has, however, offered a PR
> where, for the non-online case, storage of the indices contributing
> to each node can be optionally turned on:
> https://github.com/scikit-learn/scikit-learn/pull/8808
> <https://github.com/scikit-learn/scikit-learn/pull/8808>
>
> As for finding what is contained under any particular node,
> traversing the tree is a fairly basic task from a computer science
> perspective. Before we were to support something to make this much
> easier, I think we'd need to be clear on what kinds of use case we
> were supporting. What do you hope to do with this information, and
> what would a function interface look like that would make this much
> easier?
>
> Decimals aren't a practical option as the branching factor may be
> greater than 10, it is a hard structure to inspect, and susceptible
> to computational imprecision. Better off with a list of tuples, but
> what for that is not easy enough to do now?
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org <mailto:scikit-learn at python.org>
> https://mail.python.org/mailman/listinfo/scikit-learn
> <https://mail.python.org/mailman/listinfo/scikit-learn>
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
More information about the scikit-learn
mailing list