[scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split?

Tue Jun 4 20:44:38 EDT 2019

Thank you all for the replies.

I agree that prediction accuracy is great for evaluating black-box ML
models. Especially advanced models like neural networks, or not-so-black
models like LASSO, because they are NP-hard to solve.

Linear regression is not a black-box. I view prediction accuracy as an
overkill on interpretable models. Especially when you can use R-squared,
coefficient significance, etc.

Prediction accuracy also does not tell you which feature is important.

What do you guys think? Thank you!

.

On Mon, Jun 3, 2019 at 11:43 AM Andreas Mueller <t3kcit at gmail.com> wrote:

> This classical paper on statistical practices (Breiman's "two cultures")
> might be helpful to understand the different viewpoints:
>
> https://projecteuclid.org/euclid.ss/1009213726
>
>
> On 6/3/19 12:19 AM, Brown J.B. via scikit-learn wrote:
>
> As far as I understand: Holding out a test set is recommended if you
>> aren't entirely sure that the assumptions of the model are held (gaussian
>> error on a linear fit; independent and identically distributed samples).
>> The model evaluation approach in predictive ML, using held-out data, relies
>> only on the weaker assumption that the metric you have chosen, when applied
>> to the test set you have held out, forms a reasonable measure of
>> generalised / real-world performance. (Of course this too is often not held
>> in practice, but it is the primary assumption, in my opinion, that ML
>> practitioners need to be careful of.)
>>
>
> Dear CW,
> As Joel as said, holding out a test set will help you evaluate the
> validity of model assumptions, and his last point (reasonable measure of
> generalised performance) is absolutely essential for understanding the
> capabilities and limitations of ML.
>
> To add to your checklist of interpreting ML papers properly, be cautious
> when interpreting reports of high performance when using 5/10-fold or
> Leave-One-Out cross-validation on large datasets, where "large" depends on
> the nature of the problem setting.
> Results are also highly dependent on the distributions of the underlying
> independent variables (e.g., 60000 datapoints all with near-identical
> distributions may yield phenomenal performance in cross validation and be
> almost non-predictive in truly unknown/prospective situations).
> Even at 500 datapoints, if independent variable distributions look similar
> (with similar endpoints), then when each model is trained on 80% of that
> data, the remaining 20% will certainly be predictable, and repeating that
> five times will yield statistics that seem impressive.
>
> So, again, while problem context completely dictates ML experiment design,
> metric selection, and interpretation of outcome, my personal rule of thumb
> is to do no-more than 2-fold cross-validation (50% train, 50% predict) when
> having 100+ datapoints.
> Even more extreme, using try 33% for training and 66% for validation (or
> even 20/80).
> If your model still reports good statistics, then you can believe that the
> patterns in the training data extrapolate well to the ones in the external
> validation data.
>
> Hope this helps,
> J.B.
>
>
>
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190604/0c510e3c/attachment.html>