Any way to tune the parameters better than GridSearchCV?
Take random forest as example, if I give estimator from 10 to 10000(10, 100, 1000, 10000) into grid search. Based on the result, I found estimator=100 is the best, but I don't know lower or greater than 100 is better. How should I decide? brute force or any tools better than GridSearchCV? thx
Take random forest as example, if I give estimator from 10 to 10000(10, 100, 1000, 10000) into grid search. Based on the result, I found estimator=100 is the best, but I don't know lower or greater than 100 is better. How should I decide? brute force or any tools better than GridSearchCV?
A simple but nonetheless practical solution is to (1) start with an upper bound on the number of trees you are willing to accept in the model, (2) obtain its performance (ACC, MCC, F1, etc) as the starting reference point, (3) systematically lower the number of trees (log2 scale down, fixed size decrement, etc) (4) obtain the reduced forest size performance, (5) Repeat (3)-(4) until [performance(reference) - performance(current forest size)] > tolerance You can encapsulate that in a function which then returns the final model you obtain.
From the model object, the number of trees can be obtained.
J.B.
I would like to make a related suggestion but instead of focusing on the upper bound for the number of trees rather set choosing the lower bound. From a theoretical perspective, it doesn't make sense to me how fewer trees can result in a better performing random forest model in terms of generalization performance. If you observe a better performance on the same independent test set with fewer trees, I would say that this is likely not a good indicator of better generalization performance. It could be due to overfitting and train/test set resampling and/or picking up artifacts in the dataset. As a general suggestion, I would suggest choosing a reasonable number of trees that seems computationally feasible given the size of the dataset and the number hyperparameters to compare via model selection. Then, after tuning, I would use the best hyperparameter setting with 10x more trees and see if you notice any significant different in the cross-validation performance. Next, I would use the model and fit it to the whole training set with those best hyperparameters and evaluate the performance on the independent test set. Best, Sebastian
On Dec 24, 2018, at 9:27 PM, Brown J.B. via scikit-learn <scikit-learn@python.org> wrote:
Take random forest as example, if I give estimator from 10 to 10000(10, 100, 1000, 10000) into grid search. Based on the result, I found estimator=100 is the best, but I don't know lower or greater than 100 is better. How should I decide? brute force or any tools better than GridSearchCV?
A simple but nonetheless practical solution is to (1) start with an upper bound on the number of trees you are willing to accept in the model, (2) obtain its performance (ACC, MCC, F1, etc) as the starting reference point, (3) systematically lower the number of trees (log2 scale down, fixed size decrement, etc) (4) obtain the reduced forest size performance, (5) Repeat (3)-(4) until [performance(reference) - performance(current forest size)] > tolerance
You can encapsulate that in a function which then returns the final model you obtain. From the model object, the number of trees can be obtained.
J.B. _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
participants (3)
-
Brown J.B. -
lampahome -
Sebastian Raschka