<div dir="ltr"><div>You're right that you don't need to use CV for hyperparameter estimation in linear regression, but you may want it for model evaluation.</div><div><br></div>As far as I understand: Holding out a test set is recommended if you aren't entirely sure that the assumptions of the model are held (gaussian error on a linear fit; independent and identically distributed samples). The model evaluation approach in predictive ML, using held-out data, relies only on the weaker assumption that the metric you have chosen, when applied to the test set you have held out, forms a reasonable measure of generalised / real-world performance. (Of course this too is often not held in practice, but it is the primary assumption, in my opinion, that ML practitioners need to be careful of.)</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, 2 Jun 2019 at 12:43, C W <<a href="mailto:tmrsg11@gmail.com" target="_blank">tmrsg11@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Hi Nicholas,</div><div><br></div><div>I don't get it. <br></div><div><br></div><div>The coefficients are estimated through OLS. Essentially, you are just calculating a matrix pseudo inverse, where</div><div>beta = (X^T * X)^(-1) * X^T * y<br></div><div><br></div><div>Splitting the data does not improve the model, It only works in something like LASSO, where you have a tuning parameter.<br></div><div><br></div><div>Holding out some data will make the regression estimates worse off.</div><div><br></div><div>Hope to hear from you, thanks!<br></div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Jun 1, 2019 at 10:04 AM Nicolas Hug <<a href="mailto:niourf@gmail.com" target="_blank">niourf@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
Splitting the data into train and test data is needed with any
machine learning model (not just linear regression with or without
least squares).
<p>The idea is that you want to evaluate the performance of your
model (prediction + scoring) on a portion of the data that you did
not use for training.</p>
<p>You'll find more details in the user guide
<a class="gmail-m_2438479440282568678gmail-m_-6384108789249064627gmail-m_429205647118890919moz-txt-link-freetext" href="https://scikit-learn.org/stable/modules/cross_validation.html" target="_blank">https://scikit-learn.org/stable/modules/cross_validation.html</a><br>
</p>
<p>Nicolas</p>
<p><br>
</p>
<div class="gmail-m_2438479440282568678gmail-m_-6384108789249064627gmail-m_429205647118890919moz-cite-prefix">On 5/31/19 8:54 PM, C W wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">Hello everyone,
<div><br>
</div>
<div>I'm new to scikit learn. I see that many tutorial in
scikit-learn follows the work-flow along the lines of</div>
<div>1) tranform the data</div>
<div>2) split the data: train, test</div>
<div>3) instantiate the sklearn object and fit</div>
<div>4) predict and tune parameter</div>
<div><br>
</div>
<div>But, linear regression is done in least squares, so I don't
think train test split is necessary. So, I guess I can just
use the entire dataset?</div>
<div><br>
</div>
<div>Thanks in advance!</div>
</div>
<br>
<fieldset class="gmail-m_2438479440282568678gmail-m_-6384108789249064627gmail-m_429205647118890919mimeAttachmentHeader"></fieldset>
<pre class="gmail-m_2438479440282568678gmail-m_-6384108789249064627gmail-m_429205647118890919moz-quote-pre">_______________________________________________
scikit-learn mailing list
<a class="gmail-m_2438479440282568678gmail-m_-6384108789249064627gmail-m_429205647118890919moz-txt-link-abbreviated" href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a>
<a class="gmail-m_2438479440282568678gmail-m_-6384108789249064627gmail-m_429205647118890919moz-txt-link-freetext" href="https://mail.python.org/mailman/listinfo/scikit-learn" target="_blank">https://mail.python.org/mailman/listinfo/scikit-learn</a>
</pre>
</blockquote>
</div>
_______________________________________________<br>
scikit-learn mailing list<br>
<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a><br>
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/scikit-learn</a><br>
</blockquote></div>
_______________________________________________<br>
scikit-learn mailing list<br>
<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a><br>
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/scikit-learn</a><br>
</blockquote></div>