[scikit-learn] biased predictions in logistic regression

Sebastian Raschka se.raschka at gmail.com
Thu Dec 15 16:43:35 EST 2016


Subtracting the median wouldn’t result in normalizing the usual sense, since subtracting a constant just shifts the values by a constant. Instead, for logistic regression & most optimizers, I would recommend subtracting the mean to center the features at mean zero and divide by the standard deviation to get “z” scores (e.g., this can be done by the StandardScaler()).

Best,
Sebastian

> On Dec 15, 2016, at 4:02 PM, Rachel Melamed <melamed at uchicago.edu> wrote:
> 
> I just tried it and it did not appear to change the results at all?
> I ran it as follows:
> 1) Normalize dummy variables (by subtracting median) to make a matrix of about 10000 x 5 
> 
> 2) For each of the 1000 output variables:
> a. Each output variable uses the same dummy variables, but not all settings of covariates are observed for all output variables. So I create the design matrix using patsy per output variable to include pairwise interactions.  Then, I have an around 10000 x 350 design matrix , and a matrix I call “success_fail” that has for each setting the number of success and number of fail, so it is of size 10000 x 2
> 
> b. Run regression using:
>     
> skdesign = np.vstack((design,design))
>     
> sklabel = np.hstack((np.ones(success_fail.shape[0]), 
> np.zeros(success_fail.shape[0])))
>     
> skweight = np.hstack((success_fail['success'], success_fail['fail']))
> 
>         logregN = linear_model.LogisticRegression(C=1, 
>                                     solver= 'lbfgs',fit_intercept=False) 
>         logregN.fit(skdesign, sklabel, sample_weight=skweight)
> 
> 
>> On Dec 15, 2016, at 2:16 PM, Alexey Dral <aadral at gmail.com> wrote:
>> 
>> Could you try to normalize dataset after feature dummy encoding and see if it is reproducible behavior?
>> 
>> 2016-12-15 22:03 GMT+03:00 Rachel Melamed <melamed at uchicago.edu>:
>> Thanks for the reply.  The covariates (“X") are all dummy/categorical variables.  So I guess no, nothing is normalized.
>> 
>>> On Dec 15, 2016, at 1:54 PM, Alexey Dral <aadral at gmail.com> wrote:
>>> 
>>> Hi Rachel,
>>> 
>>> Do you have your data normalized?
>>> 
>>> 2016-12-15 20:21 GMT+03:00 Rachel Melamed <melamed at uchicago.edu>:
>>> Hi all,
>>> Does anyone have any suggestions for this problem:
>>> http://stackoverflow.com/questions/41125342/sklearn-logistic-regression-gives-biased-results
>>> 
>>> I am running around 1000 similar logistic regressions, with the same covariates but slightly different data and response variables. All of my response variables have a sparse successes (p(success) < .05 usually).
>>> 
>>> I noticed that with the regularized regression, the results are consistently biased to predict more "successes" than is observed in the training data. When I relax the regularization, this bias goes away. The bias observed is unacceptable for my use case, but the more-regularized model does seem a bit better.
>>> 
>>> Below, I plot the results for the 1000 different regressions for 2 different values of C: 
>>> 
>>> I looked at the parameter estimates for one of these regressions: below each point is one parameter. It seems like the intercept (the point on the bottom left) is too high for the C=1 model. 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Yours sincerely,
>>> Alexey A. Dral
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> 
>> 
>> 
>> -- 
>> Yours sincerely,
>> Alexey A. Dral
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn



More information about the scikit-learn mailing list