[scikit-learn] biased predictions in logistic regression
Stuart Reynolds
stuart at stuartreynolds.net
Thu Dec 15 16:41:32 EST 2016
LR is biased with imbalanced datasets. Is your dataset unbalanced? (e.g. is
there one class that has a much smaller prevalence in the data that the
other)?
On Thu, Dec 15, 2016 at 1:02 PM, Rachel Melamed <melamed at uchicago.edu>
wrote:
> I just tried it and it did not appear to change the results at all?
> I ran it as follows:
> 1) Normalize dummy variables (by subtracting median) to make a matrix of
> about 10000 x 5
>
> 2) For each of the 1000 output variables:
> a. Each output variable uses the same dummy variables, but not all
> settings of covariates are observed for all output variables. So I create
> the design matrix using patsy per output variable to include pairwise
> interactions. Then, I have an around 10000 x 350 design matrix , and a
> matrix I call “success_fail” that has for each setting the number of
> success and number of fail, so it is of size 10000 x 2
>
> b. Run regression using:
> skdesign = np.vstack((design,design))
> sklabel = np.hstack((np.ones(success_fail.shape[0]),
> np.zeros(success_fail.shape[0])))
> skweight = np.hstack((success_fail['success'], success_fail['fail']))
>
> logregN = linear_model.LogisticRegression(C=1,
> solver= 'lbfgs',fit_intercept=False)
> logregN.fit(skdesign, sklabel, sample_weight=skweight)
>
>
> On Dec 15, 2016, at 2:16 PM, Alexey Dral <aadral at gmail.com> wrote:
>
> Could you try to normalize dataset after feature dummy encoding and see if
> it is reproducible behavior?
>
> 2016-12-15 22:03 GMT+03:00 Rachel Melamed <melamed at uchicago.edu>:
>
>> Thanks for the reply. The covariates (“X") are all dummy/categorical
>> variables. So I guess no, nothing is normalized.
>>
>> On Dec 15, 2016, at 1:54 PM, Alexey Dral <aadral at gmail.com> wrote:
>>
>> Hi Rachel,
>>
>> Do you have your data normalized?
>>
>> 2016-12-15 20:21 GMT+03:00 Rachel Melamed <melamed at uchicago.edu>:
>>
>>> Hi all,
>>> Does anyone have any suggestions for this problem:
>>> http://stackoverflow.com/questions/41125342/sklearn-logistic
>>> -regression-gives-biased-results
>>>
>>> I am running around 1000 similar logistic regressions, with the same
>>> covariates but slightly different data and response variables. All of my
>>> response variables have a sparse successes (p(success) < .05 usually).
>>>
>>> I noticed that with the regularized regression, the results are
>>> consistently biased to predict more "successes" than is observed in the
>>> training data. When I relax the regularization, this bias goes away. The
>>> bias observed is unacceptable for my use case, but the more-regularized
>>> model does seem a bit better.
>>>
>>> Below, I plot the results for the 1000 different regressions for 2
>>> different values of C: [image: results for the different regressions
>>> for 2 different values of C] <https://i.stack.imgur.com/1cbrC.png>
>>>
>>> I looked at the parameter estimates for one of these regressions: below
>>> each point is one parameter. It seems like the intercept (the point on the
>>> bottom left) is too high for the C=1 model. [image: enter image
>>> description here] <https://i.stack.imgur.com/NTFOY.png>
>>>
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>>
>> --
>> Yours sincerely,
>> Alexey A. Dral
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
>
> --
> Yours sincerely,
> Alexey A. Dral
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161215/b3892439/attachment.html>
More information about the scikit-learn
mailing list