[scikit-learn] Can I evaluate clustering efficiency incrementally?
joel.nothman at gmail.com
Thu May 16 03:06:37 EDT 2019
The contingency matrix (
counts how many times each pair of (true cluster, predicted cluster)
occurs. It is sufficient statistics for every "supervised" (i.e. ground
truth-based) clustering evaluation metric in Scikit-learn. In an
incremental setting, you can simply add to the contingency matrix with each
new predicted batch. In
https://github.com/scikit-learn/scikit-learn/issues/8103 I proposed that we
provide an API for calculating clustering metrics from the sufficient
statistics alone, but it's not come to fruition.
On Thu, 16 May 2019 at 11:47, lampahome <pahome.chen at mirlab.org> wrote:
> Joel Nothman <joel.nothman at gmail.com> 於 2019年5月15日 週三 下午12:16寫道：
>> Evaluating on large datasets is easy if the sufficient statistics are
>> just the contingency matrix.
> Sorry, I don't understand it. Can you explain detailly?
> You mean we could take subset of samples to evaluating if subset is
> contingency(normal distribution) matrix?
> scikit-learn mailing list
> scikit-learn at python.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the scikit-learn