[scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split?

Andreas Mueller t3kcit at gmail.com
Mon Jun 3 11:41:17 EDT 2019


This classical paper on statistical practices (Breiman's "two cultures") 
might be helpful to understand the different viewpoints:

https://projecteuclid.org/euclid.ss/1009213726


On 6/3/19 12:19 AM, Brown J.B. via scikit-learn wrote:
>
>     As far as I understand: Holding out a test set is recommended if
>     you aren't entirely sure that the assumptions of the model are
>     held (gaussian error on a linear fit; independent and identically
>     distributed samples). The model evaluation approach in predictive
>     ML, using held-out data, relies only on the weaker assumption that
>     the metric you have chosen, when applied to the test set you have
>     held out, forms a reasonable measure of generalised / real-world
>     performance. (Of course this too is often not held in practice,
>     but it is the primary assumption, in my opinion, that ML
>     practitioners need to be careful of.)
>
>
> Dear CW,
> As Joel as said, holding out a test set will help you evaluate the 
> validity of model assumptions, and his last point (reasonable measure 
> of generalised performance) is absolutely essential for understanding 
> the capabilities and limitations of ML.
>
> To add to your checklist of interpreting ML papers properly, be 
> cautious when interpreting reports of high performance when using 
> 5/10-fold or Leave-One-Out cross-validation on large datasets, where 
> "large" depends on the nature of the problem setting.
> Results are also highly dependent on the distributions of the 
> underlying independent variables (e.g., 60000 datapoints all with 
> near-identical distributions may yield phenomenal performance in cross 
> validation and be almost non-predictive in truly unknown/prospective 
> situations).
> Even at 500 datapoints, if independent variable distributions look 
> similar (with similar endpoints), then when each model is trained on 
> 80% of that data, the remaining 20% will certainly be predictable, and 
> repeating that five times will yield statistics that seem impressive.
>
> So, again, while problem context completely dictates ML experiment 
> design, metric selection, and interpretation of outcome, my personal 
> rule of thumb is to do no-more than 2-fold cross-validation (50% 
> train, 50% predict) when having 100+ datapoints.
> Even more extreme, using try 33% for training and 66% for validation 
> (or even 20/80).
> If your model still reports good statistics, then you can believe that 
> the patterns in the training data extrapolate well to the ones in the 
> external validation data.
>
> Hope this helps,
> J.B.
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190603/a2e8b817/attachment.html>


More information about the scikit-learn mailing list