[scikit-learn] DBScan freezes my computer !!!

Joel Nothman joel.nothman at gmail.com
Sun May 13 23:07:21 EDT 2018


Note that this has long been documented under "Memory consumption for large
sample sizes" at
http://scikit-learn.org/stable/modules/clustering.html#dbscan

On 14 May 2018 at 12:59, Joel Nothman <joel.nothman at gmail.com> wrote:

> This is quite a common issue with our implementation of DBSCAN, and
> improvements to documentation would be very, very welcome.
>
> The high memory cost comes from constructing the pairwise radius neighbors
> for all points. If using a distance metric that cannot be indexed with a
> KD-tree or Ball Tree, this results in n^2 floats being stored in memory
> even before the radius neighbors are computed.
>
> You have the following strategies available to you currently:
>
> 1. Calculate the radius neighborhoods using radius_neighbors_graph in
> chunks, so as to avoid all pairs being calculated and stored at once. This
> produces a sparse graph representation, which can be passed into dbscan
> with metric='precomputed'. (I've just seen Sebastian suggested the same.)
> 2. Reduce the number of samples in your dataset and represent
> (near-)duplicate points with sample_weight (i.e. two identical points would
> be merged but would have a sample_weight of 2).
>
> There is also a proposal to offer an alternative memory-efficient mode at
> https://github.com/scikit-learn/scikit-learn/pull/6813. Feedback is
> welcome.
>
> Cheers,
>
> Joel
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180514/5de7c0cf/attachment.html>


More information about the scikit-learn mailing list