<div dir="ltr">LR is biased with imbalanced datasets. Is your dataset unbalanced? (e.g. is there one class that has a much smaller prevalence in the data that the other)?</div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Dec 15, 2016 at 1:02 PM, Rachel Melamed <span dir="ltr"><<a href="mailto:melamed@uchicago.edu" target="_blank">melamed@uchicago.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div style="word-wrap:break-word">

I just tried it and it did not appear to change the results at all?

<div>I ran it as follows:

<div>1) Normalize dummy variables (by subtracting median) to make a matrix of about 10000 x 5 </div>

<div><br>

</div>

<div>2) For each of the 1000 output variables:</div>

<div>a. Each output variable uses the same dummy variables, but not all settings of covariates are observed for all output variables. So I create the design matrix using patsy per output variable to include pairwise interactions.  Then, I have an around

 10000 x 350 design matrix , and a matrix I call “success_fail” that has for each setting the number of success and number of fail, so it is of size 10000 x 2</div>

<div><br>

</div>

<div>b. Run regression using:</div>

<div>

<div><font face="Courier">    <span class="m_3305824014180486996Apple-tab-span" style="white-space:pre-wrap">

</span>skdesign = np.vstack((design,design))</font></div>

<div><font face="Courier">    <span class="m_3305824014180486996Apple-tab-span" style="white-space:pre-wrap">

</span>sklabel = np.hstack((np.ones(success_<wbr>fail.shape[0]), </font></div>

<div><font face="Courier"><span class="m_3305824014180486996Apple-tab-span" style="white-space:pre-wrap"></span>np.zeros(success_fail.shape[0]<wbr>)))</font></div>

<div><font face="Courier">    <span class="m_3305824014180486996Apple-tab-span" style="white-space:pre-wrap">

</span>skweight = np.hstack((success_fail['<wbr>success'], success_fail['fail']))</font></div>

<div><font face="Courier"><br>

</font></div>

<div>

<div><font face="Courier">        logregN = linear_model.<wbr>LogisticRegression(C=1, </font></div>

<div><font face="Courier">                                    solver= 'lbfgs',fit_intercept=False)</font><span style="font-family:Courier"> </span></div>

<div><font face="Courier">        logregN.fit(skdesign, sklabel, sample_weight=skweight)</font></div>

</div><div><div class="h5">

<div><br>

</div>

<div><br>

</div>

<div>

<blockquote type="cite">

<div>On Dec 15, 2016, at 2:16 PM, Alexey Dral <<a href="mailto:aadral@gmail.com" target="_blank">aadral@gmail.com</a>> wrote:</div>

<br class="m_3305824014180486996Apple-interchange-newline">

<div>

<div dir="ltr">Could you try to normalize dataset after feature dummy encoding and see if it is reproducible behavior?

<div class="gmail_extra"><br>

<div class="gmail_quote">2016-12-15 22:03 GMT+03:00 Rachel Melamed <span dir="ltr">

<<a href="mailto:melamed@uchicago.edu" target="_blank">melamed@uchicago.edu</a>></span>:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div style="word-wrap:break-word">Thanks for the reply.  The covariates (“X") are all dummy/categorical variables.  So I guess no, nothing is normalized.

<div>

<div class="m_3305824014180486996h5">

<div>

<div><br>

<div>

<blockquote type="cite">

<div>On Dec 15, 2016, at 1:54 PM, Alexey Dral <<a href="mailto:aadral@gmail.com" target="_blank">aadral@gmail.com</a>> wrote:</div>

<br class="m_3305824014180486996m_8515338733787080180Apple-interchange-newline">

<div>

<div dir="ltr">Hi Rachel,

<div><br>

</div>

<div>Do you have your data normalized?<br>

<div class="gmail_extra"><br>

<div class="gmail_quote">2016-12-15 20:21 GMT+03:00 Rachel Melamed <span dir="ltr">

<<a href="mailto:melamed@uchicago.edu" target="_blank">melamed@uchicago.edu</a>></span>:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div style="word-wrap:break-word">

<div>Hi all,</div>

<div>Does anyone have any suggestions for this problem:</div>

<a href="http://stackoverflow.com/questions/41125342/sklearn-logistic-regression-gives-biased-results" target="_blank">http://stackoverflow.com/quest<wbr>ions/41125342/sklearn-logistic<wbr>-regression-gives-biased-resul<wbr>ts</a>

<div><br>

</div>

<div>

<p style="margin:0px 0px 1em;padding:0px;border:0px;font-size:15px;clear:both;color:rgb(36,39,41);font-family:Arial,'Helvetica Neue',Helvetica,sans-serif;background-color:rgb(255,255,255)">

I am running around 1000 similar logistic regressions, with the same covariates but slightly different data and response variables. All of my response variables have a sparse successes (p(success) < .05 usually).</p>

<p style="margin:0px 0px 1em;padding:0px;border:0px;font-size:15px;clear:both;color:rgb(36,39,41);font-family:Arial,'Helvetica Neue',Helvetica,sans-serif;background-color:rgb(255,255,255)">

I noticed that with the regularized regression, the results are consistently biased to predict more "successes" than is observed in the training data. When I relax the regularization, this bias goes away. The bias observed is unacceptable for my use case, but

 the more-regularized model does seem a bit better.</p>

<p style="margin:0px 0px 1em;padding:0px;border:0px;font-size:15px;clear:both;color:rgb(36,39,41);font-family:Arial,'Helvetica Neue',Helvetica,sans-serif;background-color:rgb(255,255,255)">

Below, I plot the results for the 1000 different regressions for 2 different values of C: <a href="https://i.stack.imgur.com/1cbrC.png" rel="nofollow noreferrer" style="margin:0px;padding:0px;border:0px;color:rgb(0,89,153);text-decoration:none" target="_blank"><img src="https://i.stack.imgur.com/1cbrC.png" alt="results for the different regressions for 2 different values of C" style="margin:0px;padding:0px;border:0px;max-width:100%"></a></p>

<p style="margin:0px 0px 1em;padding:0px;border:0px;font-size:15px;clear:both;color:rgb(36,39,41);font-family:Arial,'Helvetica Neue',Helvetica,sans-serif;background-color:rgb(255,255,255)">

I looked at the parameter estimates for one of these regressions: below each point is one parameter. It seems like the intercept (the point on the bottom left) is too high for the C=1 model. <a href="https://i.stack.imgur.com/NTFOY.png" rel="nofollow noreferrer" style="margin:0px;padding:0px;border:0px;color:rgb(0,89,153);text-decoration:none" target="_blank"><img src="https://i.stack.imgur.com/NTFOY.png" alt="enter image description here" style="margin:0px;padding:0px;border:0px;max-width:100%"></a></p>

<p style="margin:0px 0px 1em;padding:0px;border:0px;font-size:15px;clear:both;color:rgb(36,39,41);font-family:Arial,'Helvetica Neue',Helvetica,sans-serif;background-color:rgb(255,255,255)">

<br>

</p>

</div>

</div>

<br>

______________________________<wbr>_________________<br>

scikit-learn mailing list<br>

<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailma<wbr>n/listinfo/scikit-learn</a><br>

<br>

</blockquote>

</div>

<br>

<br clear="all">

<div><br>

</div>

-- <br>

<div class="m_3305824014180486996m_8515338733787080180gmail_signature" data-smartmail="gmail_signature">

<div dir="ltr">

<div>

<div dir="ltr">

<div dir="ltr">

<div>Yours sincerely,</div>

<div><span style="font-size:12.8px">Alexey A. Dral</span></div>

</div>

</div>

</div>

</div>

</div>

</div>

</div>

</div>

______________________________<wbr>_________________<br>

scikit-learn mailing list<br>

<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/scikit-learn" target="_blank">https://mail.python.org/mailma<wbr>n/listinfo/scikit-learn</a><br>

</div>

</blockquote>

</div>

<br>

</div>

</div>

</div>

</div>

</div>

<br>

______________________________<wbr>_________________<br>

scikit-learn mailing list<br>

<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailma<wbr>n/listinfo/scikit-learn</a><br>

<br>

</blockquote>

</div>

<br>

<br clear="all">

<div><br>

</div>

-- <br>

<div class="m_3305824014180486996gmail_signature" data-smartmail="gmail_signature">

<div dir="ltr">

<div>

<div dir="ltr">

<div dir="ltr">

<div>Yours sincerely,</div>

<div><span style="font-size:12.8px">Alexey A. Dral</span></div>

</div>

</div>

</div>

</div>

</div>

</div>

</div>

______________________________<wbr>_________________<br>

scikit-learn mailing list<br>

<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/scikit-learn" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/scikit-learn</a><br>

</div>

</blockquote>

</div>

<br>

</div></div></div>

</div>

</div>

<br>______________________________<wbr>_________________<br>

scikit-learn mailing list<br>

<a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/scikit-learn</a><br>

<br></blockquote></div><br></div>