[scikit-learn] imbalanced datasets return uncalibrated predictions - why?

Tue Nov 17 03:57:13 EST 2020

Hello team,

I am trying to understand why does logistic regression return uncalibrated probabilities with values tending to low probabilities for the positive (rare) cases, when trained on an imbalanced dataset.

I've read a number of articles, all seem to agree that this is the case, many show empirical proof, but no mathematical demo. When I test it myself, I can see that this is indeed the case, Logit on imbalanced datasets returns uncalibrated probs.

And I understand that it has to do with the cost function, because if we re-balance the dataset with say class_weight = 'balance'. then the probabilities seem to be calibrated as a result.

I was wondering if any of you knows the mathematical demo that supports this conclusion? Any mathematical demo, or clear explanation of why logit would return uncalibrated probs when trained on an imbalanced dataset?

Any link to a relevant article, video, presentation, etc, will be greatly appreciated.

Thanks a lot!

Sole
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/scikit-learn/attachments/20201117/df97ad51/attachment.html>