Hi All, I am trying to use random forests for a regression problem, with 10 input variables and one output variable. I am getting very good fit even with default parameters and low n_estimators. Even with n_estimator = 10, I get R^2 value of 0.95 on testing dataset (MSE=23) and a value of 0.99 for the training set. I was wondering, if this is common with random forest or I am missing something, Could you please share your experience? The total number of sample (training +testing) are equal to 10971. Also, what are the most important parameters (max_depth, bootstrap, max_leaf_nodes etc.) that I need to play with to tune my model even further? Lastly, is there is a way I can visualise a single tree of my forest (just for demonstration purposes)? Please see a figure below to demonstrate how well it is fitting with default values. [image: Inline image 1] Thanks Kindest Regards Waseem
Hi Muhammad, If you've not yet read the documentation I would highly recommend starting with the Decision Tree [1] and working your way through the examples on your own data. You'll find an example [2] of how to generate a graphviz compatible dot file and visualise it. Once your satisfied that you understand what each tree is doing with your dataset as you vary parameters, then it makes sense to try to inject some randomness by varying the features used in each tree or the samples (or both [3]). Regards, Brian [1] http://scikit-learn.org/stable/modules/tree.html [2] http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphvi... [3] http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTrees... On 23 June 2016 at 10:20, muhammad waseem <m.waseem.ahmad@gmail.com> wrote:
Hi All, I am trying to use random forests for a regression problem, with 10 input variables and one output variable. I am getting very good fit even with default parameters and low n_estimators. Even with n_estimator = 10, I get R^2 value of 0.95 on testing dataset (MSE=23) and a value of 0.99 for the training set. I was wondering, if this is common with random forest or I am missing something, Could you please share your experience? The total number of sample (training +testing) are equal to 10971. Also, what are the most important parameters (max_depth, bootstrap, max_leaf_nodes etc.) that I need to play with to tune my model even further? Lastly, is there is a way I can visualise a single tree of my forest (just for demonstration purposes)? Please see a figure below to demonstrate how well it is fitting with default values.
[image: Inline image 1] Thanks Kindest Regards Waseem
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Hi Brian, Thanks for your email, I did try tree.export_graphviz(model,out_file='tree.dot'),but I got an error saying AttributeError: 'RandomForestRegressor' object has no attribute 'tree_' which I think is because this is a forest, not a single tree that's why I can't visualise it, No? Also, do you have any comments on the results that I got with default values? Regards Waseem On Thu, Jun 23, 2016 at 11:05 AM, Brian Holt <bdholt1@gmail.com> wrote:
Hi Muhammad,
If you've not yet read the documentation I would highly recommend starting with the Decision Tree [1] and working your way through the examples on your own data. You'll find an example [2] of how to generate a graphviz compatible dot file and visualise it.
Once your satisfied that you understand what each tree is doing with your dataset as you vary parameters, then it makes sense to try to inject some randomness by varying the features used in each tree or the samples (or both [3]).
Regards, Brian
[1] http://scikit-learn.org/stable/modules/tree.html [2] http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphvi... [3] http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTrees...
On 23 June 2016 at 10:20, muhammad waseem <m.waseem.ahmad@gmail.com> wrote:
Hi All, I am trying to use random forests for a regression problem, with 10 input variables and one output variable. I am getting very good fit even with default parameters and low n_estimators. Even with n_estimator = 10, I get R^2 value of 0.95 on testing dataset (MSE=23) and a value of 0.99 for the training set. I was wondering, if this is common with random forest or I am missing something, Could you please share your experience? The total number of sample (training +testing) are equal to 10971. Also, what are the most important parameters (max_depth, bootstrap, max_leaf_nodes etc.) that I need to play with to tune my model even further? Lastly, is there is a way I can visualise a single tree of my forest (just for demonstration purposes)? Please see a figure below to demonstrate how well it is fitting with default values.
[image: Inline image 1] Thanks Kindest Regards Waseem
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
It is probably a good idea to start by separating off part of your training data into a held-out development set that is not used for training, which you can use to create learning curves and estimate probable performance on unseen data. I really recommend Andrew Ng's machine learning course material from Stanford and Coursera. It shows you how to use learning curves to understand your problem and also the way that different estimators behave. There are many estimators that will achieve an extremely good fit to typical training data, but the differences between estimators show up mostly in what happens with unseen test data. Personally I always start by seeing how well simple classifiers or regressors do (Naive Bayes, linear regression, etc.), then try regularized linear models like ElasticNets then try SVMs, then try random forests or other ensemble models. That way, I finish up using the powerful and complex models only when the data demands it. On 23 June 2016 at 10:20, muhammad waseem <m.waseem.ahmad@gmail.com> wrote:
Hi All, I am trying to use random forests for a regression problem, with 10 input variables and one output variable. I am getting very good fit even with default parameters and low n_estimators. Even with n_estimator = 10, I get R^2 value of 0.95 on testing dataset (MSE=23) and a value of 0.99 for the training set. I was wondering, if this is common with random forest or I am missing something, Could you please share your experience? The total number of sample (training +testing) are equal to 10971. Also, what are the most important parameters (max_depth, bootstrap, max_leaf_nodes etc.) that I need to play with to tune my model even further? Lastly, is there is a way I can visualise a single tree of my forest (just for demonstration purposes)? Please see a figure below to demonstrate how well it is fitting with default values.
[image: Inline image 1] Thanks Kindest Regards Waseem
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Thanks, Chris. I will look into your recommendations. I have tried artificial neural network and it was giving me good results on test set as well. Regards Waseem On Thu, Jun 23, 2016 at 12:00 PM, chris brew <cbrew@acm.org> wrote:
It is probably a good idea to start by separating off part of your training data into a held-out development set that is not used for training, which you can use to create learning curves and estimate probable performance on unseen data. I really recommend Andrew Ng's machine learning course material from Stanford and Coursera. It shows you how to use learning curves to understand your problem and also the way that different estimators behave.
There are many estimators that will achieve an extremely good fit to typical training data, but the differences between estimators show up mostly in what happens with unseen test data. Personally I always start by seeing how well simple classifiers or regressors do (Naive Bayes, linear regression, etc.), then try regularized linear models like ElasticNets then try SVMs, then try random forests or other ensemble models. That way, I finish up using the powerful and complex models only when the data demands it.
On 23 June 2016 at 10:20, muhammad waseem <m.waseem.ahmad@gmail.com> wrote:
Hi All, I am trying to use random forests for a regression problem, with 10 input variables and one output variable. I am getting very good fit even with default parameters and low n_estimators. Even with n_estimator = 10, I get R^2 value of 0.95 on testing dataset (MSE=23) and a value of 0.99 for the training set. I was wondering, if this is common with random forest or I am missing something, Could you please share your experience? The total number of sample (training +testing) are equal to 10971. Also, what are the most important parameters (max_depth, bootstrap, max_leaf_nodes etc.) that I need to play with to tune my model even further? Lastly, is there is a way I can visualise a single tree of my forest (just for demonstration purposes)? Please see a figure below to demonstrate how well it is fitting with default values.
[image: Inline image 1] Thanks Kindest Regards Waseem
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
participants (3)
-
Brian Holt -
chris brew -
muhammad waseem