[scikit-learn] biased predictions in logistic regression

Stuart Reynolds stuart at stuartreynolds.net
Fri Dec 16 00:30:42 EST 2016

Sorry... I mean penalized likelihood, not large weight penalization.

Here's the reference I was thinking of

On Thu, Dec 15, 2016 at 9:12 PM <josef.pktd at gmail.com> wrote:

> just some generic comments, I don't have any experience with penalized
> estimation nor did I go through the math.
>
> In unregularized Logistis Regression or Logit and in several other models
> the estimator satisfies some aggregation properties so that in sample or
> training set proportions match between predicted proportions and those of
> the sample.
>
> Regularized estimation does not require unbiased estimation of the
> parameters because it maximizes a different objective function, like mean
> squared error in the linear model. We are trading off bias against
> variance. I think this will propagate to the prediction, but I'm not sure
> whether an unpenalized intercept can be made to compensate for the bias in
> the average prediction.
>
> For Logit this would mean that although we have a bias, we have less
> variance/variation in the prediction, so overall we are doing better than
> with unregularized prediction under the chosen penalization measure.
> I assume because the regularization biases towards zero coefficients it
> also biases towards a prediction of 0.5, unless it's compensated for by the
> intercept.
>
> I didn't read the King and Zheng (2001) article, but it doesn't mention
> penalization or regularization, based on a brief search, so it doesn't seem
> to address the regularization bias. (Aside, from the literature I think
> many people use a different model than logistic for rare events data,
> either Poisson with exponential link or Binomial/Bernoulli with an
>
> I think, demeaning could help because it reduces the dependence between
> the intercept and the other penalized variables, but because of the
> nonlinear model it will not make it orthogonal.
>
> The question is whether it's possible to improve the estimator by
> additionally adjusting the mean or the threshold for 0-1 predictions. It
> might depend on the criteria to choose the penalization. I don't know and
> have no idea what scikit-learn does.
>
> Josef
>
> On Thu, Dec 15, 2016 at 11:30 PM, Stuart Reynolds <
> stuart at stuartreynolds.net> wrote:
>
> Here's a discussion
>
>
> http://stats.stackexchange.com/questions/6067/does-an-unbalanced-sample-matter-when-doing-logistic-regression
>
> See the Zheng and King reference.
> It would be nice to have these methods in scikit.
>
>
>
> On Thu, Dec 15, 2016 at 7:05 PM Rachel Melamed <melamed at uchicago.edu>
> wrote:
>
>
>
>
>
>
>
>
>
>
>
> Stuart,
>
>
>
> Yes the data is quite imbalanced (this is what I meant by p(success) < .05
> )
>
>
>
>
>
>
>
>
>
>
>
> To be clear, I calculate
>
>
>
>
> \sum_i \hat{y_i}
> = logregN.predict_proba(design)[:,1]*(success_fail.sum(axis=1))
>
>
>
>
> and compare that number to the observed number of success. I find the
> predicted number to always be higher (I think, because of the intercept).
>
>
>
>
>
>
>
>
>
>
>
> I was not aware of a bias for imbalanced data.  Can you tell me more? Why
> does it not appear with the relaxed regularization? Also, using the same
> data with statsmodels LR, which has no regularization, this doesn't seem to
> be a problem. Any suggestions for
>
> how I could fix this are welcome.
>
>
>
>
>
>
>
>
>
>
>
> Thank you
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Dec 15, 2016, at 4:41 PM, Stuart Reynolds <stuart at stuartreynolds.net>
> wrote:
>
>
>
>
>
>
>
> LR is biased with imbalanced datasets. Is your dataset unbalanced? (e.g.
> is there one class that has a much smaller prevalence in the data that the
> other)?
>
>
>
>
>
> On Thu, Dec 15, 2016 at 1:02 PM, Rachel Melamed
>
> <melamed at uchicago.edu> wrote:
>
>
>
>
> I just tried it and it did not appear to change the results at all?
>
> I ran it as follows:
>
> 1) Normalize dummy variables (by subtracting median) to make a matrix of
>
>
>
>
>
>
>
> 2) For each of the 1000 output variables:
>
>
> a. Each output variable uses the same dummy variables, but not all
> settings of covariates are observed for all output variables. So I create
> the design matrix using patsy per output variable to include pairwise
> interactions.  Then, I have an around
>
> 10000 x 350 design matrix , and a matrix I call “success_fail” that has
> for each setting the number of success and number of fail, so it is of size
> 10000 x 2
>
>
>
>
>
>
>
> b. Run regression using:
>
>
>
>
>
>
> skdesign = np.vstack((design,design))
>
>
>
>
> sklabel = np.hstack((np.ones(success_fail.shape[0]),
>
>
> np.zeros(success_fail.shape[0])))
>
>
>
>
> skweight = np.hstack((success_fail['success'], success_fail['fail']))
>
>
>
>
>
>
>
>
>
>         logregN = linear_model.LogisticRegression(C=1,
>
>
>                                     solver= 'lbfgs',fit_intercept=False)
>
>
>         logregN.fit(skdesign, sklabel, sample_weight=skweight)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Dec 15, 2016, at 2:16 PM, Alexey Dral <aadral at gmail.com> wrote:
>
>
>
>
>
>
>
> Could you try to normalize dataset after feature dummy encoding and see if
> it is reproducible behavior?
>
>
>
>
> 2016-12-15 22:03 GMT+03:00 Rachel Melamed
>
> <melamed at uchicago.edu>:
>
>
>
>
> Thanks for the reply.  The covariates (“X") are all dummy/categorical
> variables.  So I guess no, nothing is normalized.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Dec 15, 2016, at 1:54 PM, Alexey Dral <aadral at gmail.com> wrote:
>
>
>
>
>
>
>
> Hi Rachel,
>
>
>
>
>
>
> Do you have your data normalized?
>
>
>
>
>
> 2016-12-15 20:21 GMT+03:00 Rachel Melamed
>
> <melamed at uchicago.edu>:
>
>
>
>
>
>
> Hi all,
>
>
> Does anyone have any suggestions for this problem:
>
>
>
> http://stackoverflow.com/questions/41125342/sklearn-logistic-regression-gives-biased-results
>
>
>
>
>
>
>
>
>
>
> I am running around 1000 similar logistic regressions, with the same
> covariates but slightly different data and response variables. All of my
> response variables have a sparse successes (p(success) < .05 usually).
>
>
>
>
> I noticed that with the regularized regression, the results are
> consistently biased to predict more "successes" than is observed in the
> training data. When I relax the regularization, this bias goes away. The
> bias observed is unacceptable for my use case, but
>
> the more-regularized model does seem a bit better.
>
>
>
>
> Below, I plot the results for the 1000 different regressions for 2
> different values of C: [image: results for the different regressions for
> 2 different values of C] <https://i.stack.imgur.com/1cbrC.png>
>
>
>
>
> I looked at the parameter estimates for one of these regressions: below
> each point is one parameter. It seems like the intercept (the point on the
> bottom left) is too high for the C=1 model. [image: enter image
> description here] <https://i.stack.imgur.com/NTFOY.png>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
>
>
> scikit-learn mailing list
>
>
> scikit-learn at python.org
>
>
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
>
>
>
>
>
>
>
>
>
>
>
>
> Yours sincerely,
>
>
> Alexey A. Dral
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
>
>
> scikit-learn mailing list
>
>
> scikit-learn at python.org
>
>
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
>
>
> scikit-learn mailing list
>
>
> scikit-learn at python.org
>
>
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
>
>
>
>
>
>
>
>
>
>
>
>
> Yours sincerely,
>
>
> Alexey A. Dral
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
>
>
> scikit-learn mailing list
>
>
> scikit-learn at python.org
>
>
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
>
>
> scikit-learn mailing list
>
>
> scikit-learn at python.org
>
>
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
>
>
> scikit-learn mailing list
>
>
> scikit-learn at python.org
>
>
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
>
> scikit-learn mailing list
>
> scikit-learn at python.org
>
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
>
> _______________________________________________
>
>
> scikit-learn mailing list
>
>
> scikit-learn at python.org
>
>
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
>
>
>
> _______________________________________________
>
> scikit-learn mailing list
>
> scikit-learn at python.org
>
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161216/3bbf0215/attachment-0001.html>