[scikit-learn] biased predictions in logistic regression

Thu Dec 15 17:02:08 EST 2016

The problem is the (stupid!)  liblinear solver that also penalises the
intercept (in regularisation) . Use a different solver or change  the
intercept_scaling parameter

On 15 Dec 2016 10:44 pm, "Sebastian Raschka" <se.raschka at gmail.com> wrote:

> Subtracting the median wouldn’t result in normalizing the usual sense,
> since subtracting a constant just shifts the values by a constant. Instead,
> for logistic regression & most optimizers, I would recommend subtracting
> the mean to center the features at mean zero and divide by the standard
> deviation to get “z” scores (e.g., this can be done by the
> StandardScaler()).
>
> Best,
> Sebastian
>
> > On Dec 15, 2016, at 4:02 PM, Rachel Melamed <melamed at uchicago.edu>
> wrote:
> >
> > I just tried it and it did not appear to change the results at all?
> > I ran it as follows:
> > 1) Normalize dummy variables (by subtracting median) to make a matrix of
> about 10000 x 5
> >
> > 2) For each of the 1000 output variables:
> > a. Each output variable uses the same dummy variables, but not all
> settings of covariates are observed for all output variables. So I create
> the design matrix using patsy per output variable to include pairwise
> interactions.  Then, I have an around 10000 x 350 design matrix , and a
> matrix I call “success_fail” that has for each setting the number of
> success and number of fail, so it is of size 10000 x 2
> >
> > b. Run regression using:
> >
> > skdesign = np.vstack((design,design))
> >
> > sklabel = np.hstack((np.ones(success_fail.shape[0]),
> > np.zeros(success_fail.shape[0])))
> >
> > skweight = np.hstack((success_fail['success'], success_fail['fail']))
> >
> >         logregN = linear_model.LogisticRegression(C=1,
> >                                     solver= 'lbfgs',fit_intercept=False)
> >         logregN.fit(skdesign, sklabel, sample_weight=skweight)
> >
> >
> >> On Dec 15, 2016, at 2:16 PM, Alexey Dral <aadral at gmail.com> wrote:
> >>
> >> Could you try to normalize dataset after feature dummy encoding and see
> if it is reproducible behavior?
> >>
> >> 2016-12-15 22:03 GMT+03:00 Rachel Melamed <melamed at uchicago.edu>:
> >> Thanks for the reply.  The covariates (“X") are all dummy/categorical
> variables.  So I guess no, nothing is normalized.
> >>
> >>> On Dec 15, 2016, at 1:54 PM, Alexey Dral <aadral at gmail.com> wrote:
> >>>
> >>> Hi Rachel,
> >>>
> >>> Do you have your data normalized?
> >>>
> >>> 2016-12-15 20:21 GMT+03:00 Rachel Melamed <melamed at uchicago.edu>:
> >>> Hi all,
> >>> Does anyone have any suggestions for this problem:
> >>> http://stackoverflow.com/questions/41125342/sklearn-
> logistic-regression-gives-biased-results
> >>>
> >>> I am running around 1000 similar logistic regressions, with the same
> covariates but slightly different data and response variables. All of my
> response variables have a sparse successes (p(success) < .05 usually).
> >>>
> >>> I noticed that with the regularized regression, the results are
> consistently biased to predict more "successes" than is observed in the
> training data. When I relax the regularization, this bias goes away. The
> bias observed is unacceptable for my use case, but the more-regularized
> model does seem a bit better.
> >>>
> >>> Below, I plot the results for the 1000 different regressions for 2
> different values of C:
> >>>
> >>> I looked at the parameter estimates for one of these regressions:
> below each point is one parameter. It seems like the intercept (the point
> on the bottom left) is too high for the C=1 model.
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> scikit-learn mailing list
> >>> scikit-learn at python.org
> >>> https://mail.python.org/mailman/listinfo/scikit-learn
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> Yours sincerely,
> >>> Alexey A. Dral
> >>> _______________________________________________
> >>> scikit-learn mailing list
> >>> scikit-learn at python.org
> >>> https://mail.python.org/mailman/listinfo/scikit-learn
> >>
> >>
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn at python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >>
> >>
> >>
> >>
> >> --
> >> Yours sincerely,
> >> Alexey A. Dral
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn at python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161215/debf106c/attachment.html>