[scikit-learn] clustering on big dataset

Thu Jan 4 06:55:49 EST 2018

Can you use nearest neighbors with a KD tree to build a distance matrix
that is sparse, in that distances to all but the nearest neighbors of a
point are (near-)infinite? Yes, this again has an additional parameter
(neighborhood size), just as BIRCH has its threshold. I suspect you will
not be able to improve on having another, approximating, parameter. You do
not need to set n_clusters to a fixed value for BIRCH. You only need to
provide another clusterer, which has its own parameters, although you
should be able to experiment with different "global clusterers".

On 4 January 2018 at 11:04, Shiheng Duan <shiduan at ucdavis.edu> wrote:

> Yes, it is an efficient method, still, we need to specify the number of
> clusters or the threshold. Is there another way to run hierarchy clustering
> on the big dataset? The main problem is the distance matrix.
> Thanks.
>
> On Tue, Jan 2, 2018 at 6:02 AM, Olivier Grisel <olivier.grisel at ensta.org>
> wrote:
>
>> Have you had a look at BIRCH?
>>
>> http://scikit-learn.org/stable/modules/clustering.html#birch
>>
>> --
>> Olivier
>> 
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180104/c7bf04a9/attachment.html>