[scikit-learn] Accessing Clustering Feature Tree in Birch

Mon Oct 2 05:14:52 EDT 2017

Hello,

sklearn.cluster.Birch follows the original BIRCH paper, that appears to 
be mostly focused on efficiently building the hierarchical clustering 
tree (and not so much on making the later analysis user friendly). The 
attributes exposed by Birch are those that could be reasonably exposed 
given the scikit-learn API constraints. Though, one does have access to 
the full cluster hierarchy via the Birch.root_.

As Joel said, traversing the tree is a standard CS problem, and there is 
also probably a number of operations that could be done with it, 
depending on the application. For instance, for my use case, I found 
that re-constructing the Birch hierarchy using a custom container class 
for each subcluster was the easiest to run subsequent analysis with. A 
detailed example can be found here,
http://freediscovery.io/doc/stable/python/examples/birch_cluster_hierarchy.html
Alternatively, I wonder if converting the tree to a format readable by 
some tree/graph specialized library (e.g. networkx) could be useful for 
analysis.

Generally there is a number of places in scikit-learn where trees are 
used (Birch, AgglomerativeClustering, tree bases classifiers, etc) but 
for now there is no way to export the constructed tree to some standard 
format (apart for sklearn.tree.export_graphviz). Not sure if this is 
realistically achievable though..

-- 
Roman

On 20/09/17 13:40, Sema Atasever wrote:
> I need this information to use it in a scientific study and
> I think that a function interface would make this easier.
>
> Thank you for your answer.
>
> On Sat, Sep 16, 2017 at 1:53 PM, Joel Nothman <joel.nothman at gmail.com
> <mailto:joel.nothman at gmail.com>> wrote:
>
>     There is no such thing as "the data samples in this cluster". The
>     point of Birch being online is that it loses any reference to the
>     individual samples that contributed to each node, but stores some
>     statistics on their basis. Roman Yurchak has, however, offered a PR
>     where, for the non-online case, storage of the indices contributing
>     to each node can be optionally turned on:
>     https://github.com/scikit-learn/scikit-learn/pull/8808
>     <https://github.com/scikit-learn/scikit-learn/pull/8808>
>
>     As for finding what is contained under any particular node,
>     traversing the tree is a fairly basic task from a computer science
>     perspective. Before we were to support something to make this much
>     easier, I think we'd need to be clear on what kinds of use case we
>     were supporting. What do you hope to do with this information, and
>     what would a function interface look like that would make this much
>     easier?
>
>     Decimals aren't a practical option as the branching factor may be
>     greater than 10, it is a hard structure to inspect, and susceptible
>     to computational imprecision. Better off with a list of tuples, but
>     what for that is not easy enough to do now?
>
>
>
>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>     https://mail.python.org/mailman/listinfo/scikit-learn
>     <https://mail.python.org/mailman/listinfo/scikit-learn>
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>