[scikit-learn] DBScan freezes my computer !!!

Andreas Mueller t3kcit at gmail.com
Wed May 16 13:44:17 EDT 2018


Should we have "low memory"/batched version of k_neighbors_graph and 
epsilon_neighbors_graph functions? I assume
those instantiate the dense matrix right now.


On 05/13/2018 10:59 PM, Joel Nothman wrote:
> This is quite a common issue with our implementation of DBSCAN, and 
> improvements to documentation would be very, very welcome.
>
> The high memory cost comes from constructing the pairwise radius 
> neighbors for all points. If using a distance metric that cannot be 
> indexed with a KD-tree or Ball Tree, this results in n^2 floats being 
> stored in memory even before the radius neighbors are computed.
>
> You have the following strategies available to you currently:
>
> 1. Calculate the radius neighborhoods using radius_neighbors_graph in 
> chunks, so as to avoid all pairs being calculated and stored at once. 
> This produces a sparse graph representation, which can be passed into 
> dbscan with metric='precomputed'. (I've just seen Sebastian suggested 
> the same.)
> 2. Reduce the number of samples in your dataset and represent 
> (near-)duplicate points with sample_weight (i.e. two identical points 
> would be merged but would have a sample_weight of 2).
>
> There is also a proposal to offer an alternative memory-efficient mode 
> at https://github.com/scikit-learn/scikit-learn/pull/6813. Feedback is 
> welcome.
>
> Cheers,
>
> Joel
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180516/f9ad143a/attachment-0001.html>


More information about the scikit-learn mailing list