I didn't express myself well but I was meaning:
model selection via k-fold on the training set
for the training/validation set :D On 27 January 2017 at 00:37, Sebastian Raschka <se.raschka@gmail.com> wrote:
Furthermore, a training, validation, and testing set should be used when setting up parameters.
Usually, it’s better to use a train set and separate test set, and do model selection via k-fold on the training set. Then, you do the final model estimation on the test set that you haven’t touched before. I often use “training, validation, and testing “ approach as well, though, especially when working with large datasets and for early stopping on neural nets.
Best, Sebastian
On Jan 26, 2017, at 1:19 PM, Raga Markely <raga.markely@gmail.com> wrote:
Thank you, Guillaume.
1. I agree with you - that's what I have been learning and makes sense.. I was a bit surprised when I read the paper today..
2. Ah.. thank you.. I got to change my glasses :P
Best, Raga
Guillaume Lemaître g.lemaitre58 at gmail.com Thu Jan 26 12:05:12 EST 2017
• Previous message (by thread): [scikit-learn] Scores in Cross Validation • Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] 1. You should not evaluate an estimator on the data which have been used to train it. Usually, you try to minimize the classification or loss using those data and fit them as good as possible. Evaluating on an unseen testing set will give you an idea how good your estimator was able to generalize to your problem during the training. Furthermore, a training, validation, and testing set should be used when setting up parameters. Validation will be used to set the parameters and the testing will be used to evaluate your best estimator.
That is why, when using the GridSearchCV, fit will train the estimator using a training and validation test (using a given CV startegies). Finally, predict will be performed on another unseen testing set.
The bottom line is that using training data to select parameters will not ensure that you are selecting the best parameters for your problems.
2. The function is call in _fit_and_score, l. 260 and 263 for instance.
On 26 January 2017 at 17:02, Raga Markely < raga.markely at gmail.com
wrote:
Hello,
I have 2 questions regarding cross_val_score.
1. Do the scores returned by cross_val_score correspond to only the test
set or the whole data set (training and test sets)?
I tried to look at the source code, and it looks like it returns the score
of only the test set (line 145: "return_train_score=False") - I am not sure
if I am reading the codes properly, though..
sklearn/model_selection/_validation.py#L36
I came across the paper below and the authors use the score of the whole
dataset when the author performs repeated nested loop, grid search cv,
etc.. e.g. see algorithm 1 (line 1c) and 2 (line 2d) on page 3.
https://jcheminf.springeropen.com/articles/10.1186/1758-2946-6-10
I wonder what's the pros and cons of using the accuracy score of the whole
dataset vs just the test set.. any thoughts?
2. On line 283 of the cross_val_score source code, there is a function
_score. However, I can't find where this function is called. Could you let
me know where this function is called?
Thank you very much!
Raga
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- Guillaume Lemaitre INRIA Saclay - Ile-de-France Equipe PARIETAL
guillaume.lemaitre at inria.f <guillaume.lemaitre at inria.fr
r ---
https://glemaitre.github.io/ _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- Guillaume Lemaitre INRIA Saclay - Ile-de-France Equipe PARIETAL guillaume.lemaitre@inria.f <guillaume.lemaitre@inria.fr>r --- https://glemaitre.github.io/