[scikit-learn] imbalanced datasets return uncalibrated predictions - why?

Tue Nov 17 04:54:33 EST 2020

On 17/11/2020 09:57, Sole Galli via scikit-learn wrote:
> And I understand that it has to do with the cost function, because if we 
> re-balance the dataset with say class_weight = 'balance'. then the 
> probabilities seem to be calibrated as a result.

As far I know, logistic regression will have well calibrated 
probabilities even in the imbalanced case. However, with the default 
decision threshold at 0.5, some of the infrequent categories may never 
be predicted since their probability is too low.

If you use  class_weight = 'balanced' the probabilities will no longer 
be well calibrated, however you would predict some of those infrequent 
categories.

See discussions in 
https://github.com/scikit-learn/scikit-learn/issues/10613 and linked issues.

-- 
Roman