[scikit-learn] imbalanced datasets return uncalibrated predictions - why?

Sean Violante sean.violante at gmail.com
Tue Nov 17 04:17:42 EST 2020


I am not sure if you are using "calibrated" in the correct sense.
Calibrated means that the predictions align with the real world
probabilities.
so if you have a rare class it should have low probabilities



On Tue, Nov 17, 2020 at 9:58 AM Sole Galli via scikit-learn <
scikit-learn at python.org> wrote:

> Hello team,
>
> I am trying to understand why does logistic regression return uncalibrated
> probabilities with values tending to low probabilities for the positive
> (rare) cases, when trained on an imbalanced dataset.
>
> I've read a number of articles, all seem to agree that this is the case,
> many show empirical proof, but no mathematical demo. When I test it myself,
> I can see that this is indeed the case, Logit on imbalanced datasets
> returns uncalibrated probs.
>
> And I understand that it has to do with the cost function, because if we
> re-balance the dataset with say class_weight = 'balance'. then the
> probabilities seem to be calibrated as a result.
>
> I was wondering if any of you knows the mathematical demo that supports
> this conclusion? Any mathematical demo, or clear explanation of why logit
> would return uncalibrated probs when trained on an imbalanced dataset?
>
> Any link to a relevant article, video, presentation, etc, will be greatly
> appreciated.
>
> Thanks a lot!
>
> Sole
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/scikit-learn/attachments/20201117/527cf9db/attachment.html>


More information about the scikit-learn mailing list