[scikit-learn] DBScan freezes my computer !!!
Joel Nothman
joel.nothman at gmail.com
Sun May 13 22:59:15 EDT 2018
This is quite a common issue with our implementation of DBSCAN, and
improvements to documentation would be very, very welcome.
The high memory cost comes from constructing the pairwise radius neighbors
for all points. If using a distance metric that cannot be indexed with a
KD-tree or Ball Tree, this results in n^2 floats being stored in memory
even before the radius neighbors are computed.
You have the following strategies available to you currently:
1. Calculate the radius neighborhoods using radius_neighbors_graph in
chunks, so as to avoid all pairs being calculated and stored at once. This
produces a sparse graph representation, which can be passed into dbscan
with metric='precomputed'. (I've just seen Sebastian suggested the same.)
2. Reduce the number of samples in your dataset and represent
(near-)duplicate points with sample_weight (i.e. two identical points would
be merged but would have a sample_weight of 2).
There is also a proposal to offer an alternative memory-efficient mode at
https://github.com/scikit-learn/scikit-learn/pull/6813. Feedback is welcome.
Cheers,
Joel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180514/9ce620a0/attachment-0001.html>
More information about the scikit-learn
mailing list