[scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split?

Joel Nothman joel.nothman at gmail.com
Sun Jun 2 01:11:02 EDT 2019


You're right that you don't need to use CV for hyperparameter estimation in
linear regression, but you may want it for model evaluation.

As far as I understand: Holding out a test set is recommended if you aren't
entirely sure that the assumptions of the model are held (gaussian error on
a linear fit; independent and identically distributed samples). The model
evaluation approach in predictive ML, using held-out data, relies only on
the weaker assumption that the metric you have chosen, when applied to the
test set you have held out, forms a reasonable measure of generalised /
real-world performance. (Of course this too is often not held in practice,
but it is the primary assumption, in my opinion, that ML practitioners need
to be careful of.)

On Sun, 2 Jun 2019 at 12:43, C W <tmrsg11 at gmail.com> wrote:

> Hi Nicholas,
>
> I don't get it.
>
> The coefficients are estimated through OLS. Essentially, you are just
> calculating a matrix pseudo inverse, where
> beta = (X^T * X)^(-1) * X^T * y
>
> Splitting the data does not improve the model, It only works in something
> like LASSO, where you have a tuning parameter.
>
> Holding out some data will make the regression estimates worse off.
>
> Hope to hear from you, thanks!
>
>
>
> On Sat, Jun 1, 2019 at 10:04 AM Nicolas Hug <niourf at gmail.com> wrote:
>
>> Splitting the data into train and test data is needed with any machine
>> learning model (not just linear regression with or without least squares).
>>
>> The idea is that you want to evaluate the performance of your model
>> (prediction + scoring) on a portion of the data that you did not use for
>> training.
>>
>> You'll find more details in the user guide
>> https://scikit-learn.org/stable/modules/cross_validation.html
>>
>> Nicolas
>>
>>
>> On 5/31/19 8:54 PM, C W wrote:
>>
>> Hello everyone,
>>
>> I'm new to scikit learn. I see that many tutorial in scikit-learn follows
>> the work-flow along the lines of
>> 1) tranform the data
>> 2) split the data: train, test
>> 3) instantiate the sklearn object and fit
>> 4) predict and tune parameter
>>
>> But, linear regression is done in least squares, so I don't think train
>> test split is necessary. So, I guess I can just use the entire dataset?
>>
>> Thanks in advance!
>>
>> _______________________________________________
>> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190602/e7be4c4d/attachment.html>


More information about the scikit-learn mailing list