<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>This classical paper on statistical practices (Breiman's "two
cultures") might be helpful to understand the different
viewpoints:<br>
</p>
<p><a class="moz-txt-link-freetext" href="https://projecteuclid.org/euclid.ss/1009213726">https://projecteuclid.org/euclid.ss/1009213726</a></p>
<p><br>
</p>
<div class="moz-cite-prefix">On 6/3/19 12:19 AM, Brown J.B. via
scikit-learn wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAJe_vxDajYG2smoYwjxdOJJHBTEpFYAairsEaBnJaUvHu_E_VQ@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div dir="ltr">As far as I understand: Holding out a test
set is recommended if you aren't entirely sure that the
assumptions of the model are held (gaussian error on a
linear fit; independent and identically distributed
samples). The model evaluation approach in predictive ML,
using held-out data, relies only on the weaker assumption
that the metric you have chosen, when applied to the test
set you have held out, forms a reasonable measure of
generalised / real-world performance. (Of course this too
is often not held in practice, but it is the primary
assumption, in my opinion, that ML practitioners need to
be careful of.)</div>
</blockquote>
<div><br>
</div>
<div>Dear CW, <br>
</div>
<div>As Joel as said, holding out a test set will help you
evaluate the validity of model assumptions, and his last
point (reasonable measure of generalised performance) is
absolutely essential for understanding the capabilities and
limitations of ML.<br>
</div>
<div><br>
</div>
<div>To add to your checklist of interpreting ML papers
properly, be cautious when interpreting reports of high
performance when using 5/10-fold or Leave-One-Out
cross-validation on large datasets, where "large" depends on
the nature of the problem setting.</div>
<div>Results are also highly dependent on the distributions of
the underlying independent variables (e.g., 60000 datapoints
all with near-identical distributions may yield phenomenal
performance in cross validation and be almost non-predictive
in truly unknown/prospective situations).</div>
<div>Even at 500 datapoints, if independent variable
distributions look similar (with similar endpoints), then
when each model is trained on 80% of that data, the
remaining 20% will certainly be predictable, and repeating
that five times will yield statistics that seem impressive.<br>
</div>
<div><br>
</div>
<div>So, again, while problem context completely dictates ML
experiment design, metric selection, and interpretation of
outcome, my personal rule of thumb is to do no-more than
2-fold cross-validation (50% train, 50% predict) when having
100+ datapoints.</div>
<div>Even more extreme, using try 33% for training and 66% for
validation (or even 20/80).<br>
</div>
<div>If your model still reports good statistics, then you can
believe that the patterns in the training data extrapolate
well to the ones in the external validation data.</div>
<div><br>
</div>
<div>Hope this helps,</div>
<div>J.B.<br>
</div>
<div><br>
</div>
<div><br>
</div>
</div>
<div class="gmail_quote"><br>
</div>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<pre class="moz-quote-pre" wrap="">_______________________________________________
scikit-learn mailing list
<a class="moz-txt-link-abbreviated" href="mailto:scikit-learn@python.org">scikit-learn@python.org</a>
<a class="moz-txt-link-freetext" href="https://mail.python.org/mailman/listinfo/scikit-learn">https://mail.python.org/mailman/listinfo/scikit-learn</a>
</pre>
</blockquote>
</body>
</html>