[scikit-learn] imbalanced datasets return uncalibrated predictions - why?

Thu Nov 19 02:55:42 EST 2020

Thank you guys, that was actually very helpful.

Best regards
Sole

Soledad Galli
https://www.trainindata.com/

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

On Tuesday, November 17th, 2020 at 10:54 AM, Roman Yurchak <rth.yurchak at gmail.com> wrote:

> On 17/11/2020 09:57, Sole Galli via scikit-learn wrote:
>
> > And I understand that it has to do with the cost function, because if we
> >
> > re-balance the dataset with say class_weight = 'balance'. then the
> >
> > probabilities seem to be calibrated as a result.
>
> As far I know, logistic regression will have well calibrated
>
> probabilities even in the imbalanced case. However, with the default
>
> decision threshold at 0.5, some of the infrequent categories may never
>
> be predicted since their probability is too low.
>
> If you use class_weight = 'balanced' the probabilities will no longer
>
> be well calibrated, however you would predict some of those infrequent
>
> categories.
>
> See discussions in
>
> https://github.com/scikit-learn/scikit-learn/issues/10613 and linked issues.
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Roman
>
> scikit-learn mailing list
>
> scikit-learn at python.org
>
> https://mail.python.org/mailman/listinfo/scikit-learn