<div dir="ltr"><div>You're right that you don't need to use CV for hyperparameter estimation in linear regression, but you may want it for model evaluation.</div><div><br></div>As far as I understand: Holding out a test set is recommended if you aren't entirely sure that the assumptions of the model are held (gaussian error on a linear fit; independent and identically distributed samples). The model evaluation approach in predictive ML, using held-out data, relies only on the weaker assumption that the metric you have chosen, when applied to the test set you have held out, forms a reasonable measure of generalised / real-world performance. (Of course this too is often not held in practice, but it is the primary assumption, in my opinion, that ML practitioners need to be careful of.)</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, 2 Jun 2019 at 12:43, C W <<a href="mailto:tmrsg11@gmail.com" target="_blank">tmrsg11@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Hi Nicholas,</div><div><br></div><div>I don't get it. <br></div><div><br></div><div>The coefficients are estimated through OLS. Essentially, you are just calculating a matrix pseudo inverse, where</div><div>beta = (X^T * X)^(-1) * X^T * y<br></div><div><br></div><div>Splitting the data does not improve the model, It only works in something like LASSO, where you have a tuning parameter.<br></div><div><br></div><div>Holding out some data will make the regression estimates worse off.</div><div><br></div><div>Hope to hear from you, thanks!<br></div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Jun 1, 2019 at 10:04 AM Nicolas Hug <<a href="mailto:niourf@gmail.com" target="_blank">niourf@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

  <div bgcolor="#FFFFFF">

    Splitting the data into train and test data is needed with any

    machine learning model (not just linear regression with or without

    least squares).

    <p>The idea is that you want to evaluate the performance of your

      model (prediction + scoring) on a portion of the data that you did

      not use for training.</p>

    <p>You'll find more details in the user guide

      <a class="gmail-m_2438479440282568678gmail-m_-6384108789249064627gmail-m_429205647118890919moz-txt-link-freetext" href="https://scikit-learn.org/stable/modules/cross_validation.html" target="_blank">https://scikit-learn.org/stable/modules/cross_validation.html</a><br>

    </p>

    <p>Nicolas</p>

    <p><br>

    </p>

    <div class="gmail-m_2438479440282568678gmail-m_-6384108789249064627gmail-m_429205647118890919moz-cite-prefix">On 5/31/19 8:54 PM, C W wrote:<br>

    </div>

    <blockquote type="cite">

      <div dir="ltr">Hello everyone,

        <div><br>

        </div>

        <div>I'm new to scikit learn. I see that many tutorial in

          scikit-learn follows the work-flow along the lines of</div>

        <div>1) tranform the data</div>

        <div>2) split the data: train, test</div>

        <div>3) instantiate the sklearn object and fit</div>

        <div>4) predict and tune parameter</div>

        <div><br>

        </div>

        <div>But, linear regression is done in least squares, so I don't

          think train test split is necessary. So, I guess I can just

          use the entire dataset?</div>

        <div><br>

        </div>

        <div>Thanks in advance!</div>

      </div>

      <br>

      <fieldset class="gmail-m_2438479440282568678gmail-m_-6384108789249064627gmail-m_429205647118890919mimeAttachmentHeader"></fieldset>

      <pre class="gmail-m_2438479440282568678gmail-m_-6384108789249064627gmail-m_429205647118890919moz-quote-pre">_______________________________________________

scikit-learn mailing list

<a class="gmail-m_2438479440282568678gmail-m_-6384108789249064627gmail-m_429205647118890919moz-txt-link-abbreviated" href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a>

<a class="gmail-m_2438479440282568678gmail-m_-6384108789249064627gmail-m_429205647118890919moz-txt-link-freetext" href="https://mail.python.org/mailman/listinfo/scikit-learn" target="_blank">https://mail.python.org/mailman/listinfo/scikit-learn</a>

</pre>

    </blockquote>

  </div>

_______________________________________________<br>

scikit-learn mailing list<br>

<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/scikit-learn</a><br>

</blockquote></div>

_______________________________________________<br>

scikit-learn mailing list<br>

<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/scikit-learn</a><br>

</blockquote></div>