[scikit-learn] imbalanced datasets return uncalibrated predictions - why?
sean.violante at gmail.com
Tue Nov 17 04:17:42 EST 2020
I am not sure if you are using "calibrated" in the correct sense.
Calibrated means that the predictions align with the real world
so if you have a rare class it should have low probabilities
On Tue, Nov 17, 2020 at 9:58 AM Sole Galli via scikit-learn <
scikit-learn at python.org> wrote:
> Hello team,
> I am trying to understand why does logistic regression return uncalibrated
> probabilities with values tending to low probabilities for the positive
> (rare) cases, when trained on an imbalanced dataset.
> I've read a number of articles, all seem to agree that this is the case,
> many show empirical proof, but no mathematical demo. When I test it myself,
> I can see that this is indeed the case, Logit on imbalanced datasets
> returns uncalibrated probs.
> And I understand that it has to do with the cost function, because if we
> re-balance the dataset with say class_weight = 'balance'. then the
> probabilities seem to be calibrated as a result.
> I was wondering if any of you knows the mathematical demo that supports
> this conclusion? Any mathematical demo, or clear explanation of why logit
> would return uncalibrated probs when trained on an imbalanced dataset?
> Any link to a relevant article, video, presentation, etc, will be greatly
> Thanks a lot!
> scikit-learn mailing list
> scikit-learn at python.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the scikit-learn