[scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split?
Andreas Mueller
t3kcit at gmail.com
Mon Jun 3 11:41:17 EDT 2019
This classical paper on statistical practices (Breiman's "two cultures")
might be helpful to understand the different viewpoints:
https://projecteuclid.org/euclid.ss/1009213726
On 6/3/19 12:19 AM, Brown J.B. via scikit-learn wrote:
>
> As far as I understand: Holding out a test set is recommended if
> you aren't entirely sure that the assumptions of the model are
> held (gaussian error on a linear fit; independent and identically
> distributed samples). The model evaluation approach in predictive
> ML, using held-out data, relies only on the weaker assumption that
> the metric you have chosen, when applied to the test set you have
> held out, forms a reasonable measure of generalised / real-world
> performance. (Of course this too is often not held in practice,
> but it is the primary assumption, in my opinion, that ML
> practitioners need to be careful of.)
>
>
> Dear CW,
> As Joel as said, holding out a test set will help you evaluate the
> validity of model assumptions, and his last point (reasonable measure
> of generalised performance) is absolutely essential for understanding
> the capabilities and limitations of ML.
>
> To add to your checklist of interpreting ML papers properly, be
> cautious when interpreting reports of high performance when using
> 5/10-fold or Leave-One-Out cross-validation on large datasets, where
> "large" depends on the nature of the problem setting.
> Results are also highly dependent on the distributions of the
> underlying independent variables (e.g., 60000 datapoints all with
> near-identical distributions may yield phenomenal performance in cross
> validation and be almost non-predictive in truly unknown/prospective
> situations).
> Even at 500 datapoints, if independent variable distributions look
> similar (with similar endpoints), then when each model is trained on
> 80% of that data, the remaining 20% will certainly be predictable, and
> repeating that five times will yield statistics that seem impressive.
>
> So, again, while problem context completely dictates ML experiment
> design, metric selection, and interpretation of outcome, my personal
> rule of thumb is to do no-more than 2-fold cross-validation (50%
> train, 50% predict) when having 100+ datapoints.
> Even more extreme, using try 33% for training and 66% for validation
> (or even 20/80).
> If your model still reports good statistics, then you can believe that
> the patterns in the training data extrapolate well to the ones in the
> external validation data.
>
> Hope this helps,
> J.B.
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190603/a2e8b817/attachment.html>
More information about the scikit-learn
mailing list