[scikit-learn] imbalanced datasets return uncalibrated predictions - why?
Roman Yurchak
rth.yurchak at gmail.com
Tue Nov 17 04:54:33 EST 2020
On 17/11/2020 09:57, Sole Galli via scikit-learn wrote:
> And I understand that it has to do with the cost function, because if we
> re-balance the dataset with say class_weight = 'balance'. then the
> probabilities seem to be calibrated as a result.
As far I know, logistic regression will have well calibrated
probabilities even in the imbalanced case. However, with the default
decision threshold at 0.5, some of the infrequent categories may never
be predicted since their probability is too low.
If you use class_weight = 'balanced' the probabilities will no longer
be well calibrated, however you would predict some of those infrequent
categories.
See discussions in
https://github.com/scikit-learn/scikit-learn/issues/10613 and linked issues.
--
Roman
More information about the scikit-learn
mailing list