suggested machine learning algorithm
Dear scikit-learn users and developers, I have a dataset consisting of 42 observation (molnames) and 4 variables ( VDWAALS, EEL, EGB, ESURF) with which I want to make a predictive model that estimates the experimental value (Expr). I tried multivariate linear regression using 10,000 bootstrap repeats each time using 21 observations for training and the rest 21 for testing, but the average correlation was only R= 0.1727 +- 0.19779. molname VDWAALS EEL EGB
ESURF Expr CHEMBL108457 -20.4848 -96.5826 23.4584 -5.4045 -7.27193 CHEMBL388269 -50.3860 28.9403 -51.5147 -6.4061 -6.8022 CHEMBL244078 -49.1466 -21.9869 17.7999 -6.4588 -6.61742 CHEMBL244077 -53.4365 -32.8943 34.8723 -7.0384 -6.61742 CHEMBL396772 -51.4111 -34.4904 36.0326 -6.5443 -5.82207 ........
I would like your advice about what other machine learning algorithm I could try with these data. E.g. can I make a decision tree or the observations and variable are too few to avoid overfitting? I could include more variables but the observations will always remain 42. I would greatly appreciate any advice! Thomas
Hi Thomas, A number of people I've learned from have given me the following "recipe", which I hold to loosely. 1. Start with Random Forest - it should be able to give you good baseline predictive capacity. 2. Let's say you don't care about interpretability, but only care about predictive value. Keep tweaking RF parameters (use grid search + cross validation), or switch to gradient boosting. 3. Let's say you do care about interpretability. Use RF's feature_importances_ to get out the features that are important for prediction. Try linear regression on just those, may also want to try multiplying those features together to get the "interaction" product of those features. (this is using RF as a feature selection method). Beyond this, I am sure more "expert" types will be able to chime in, and also correct me if I've said anything wrong here. Cheers Eric On Sat, Oct 1, 2016 at 10:59 AM, Thomas Evangelidis <tevang3@gmail.com> wrote:
Dear scikit-learn users and developers,
I have a dataset consisting of 42 observation (molnames) and 4 variables ( VDWAALS, EEL, EGB, ESURF) with which I want to make a predictive model that estimates the experimental value (Expr). I tried multivariate linear regression using 10,000 bootstrap repeats each time using 21 observations for training and the rest 21 for testing, but the average correlation was only R= 0.1727 +- 0.19779.
molname VDWAALS EEL EGB
ESURF Expr CHEMBL108457 -20.4848 -96.5826 23.4584 -5.4045 -7.27193 CHEMBL388269 -50.3860 28.9403 -51.5147 -6.4061 -6.8022 CHEMBL244078 -49.1466 -21.9869 17.7999 -6.4588 -6.61742 CHEMBL244077 -53.4365 -32.8943 34.8723 -7.0384 -6.61742 CHEMBL396772 -51.4111 -34.4904 36.0326 -6.5443 -5.82207 ........
I would like your advice about what other machine learning algorithm I could try with these data. E.g. can I make a decision tree or the observations and variable are too few to avoid overfitting? I could include more variables but the observations will always remain 42.
I would greatly appreciate any advice!
Thomas
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Hi Thomas, What quality do you have on training? There is no silver bullet, but there is quite common technique you can use to find out if you use appropriate algorithm. You can take a look at the difference between "train" and "validation" quality of learning curves ( example <http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_c...>). If you see big gap, then you can reduce complexity of your model to overcome overfitting (reduce interaction parameter / number of variables / iterations / ...). If you see a small gap, then you can try to increase model complexity to fit your data better. Moreover, I see you have a tiny dataset and use 50/50 split. I presume, that you will train "production" model on the whole available dataset. In that case, I suggest you to use more data for training and use almost LOO <http://scikit-learn.org/stable/modules/cross_validation.html#leave-one-out-l...> approach to better estimate your predictive quality. But, be really cautious about cross-validation as you can easily overfit your data. 2016-10-01 15:59 GMT+01:00 Thomas Evangelidis <tevang3@gmail.com>:
Dear scikit-learn users and developers,
I have a dataset consisting of 42 observation (molnames) and 4 variables ( VDWAALS, EEL, EGB, ESURF) with which I want to make a predictive model that estimates the experimental value (Expr). I tried multivariate linear regression using 10,000 bootstrap repeats each time using 21 observations for training and the rest 21 for testing, but the average correlation was only R= 0.1727 +- 0.19779.
molname VDWAALS EEL EGB
ESURF Expr CHEMBL108457 -20.4848 -96.5826 23.4584 -5.4045 -7.27193 CHEMBL388269 -50.3860 28.9403 -51.5147 -6.4061 -6.8022 CHEMBL244078 -49.1466 -21.9869 17.7999 -6.4588 -6.61742 CHEMBL244077 -53.4365 -32.8943 34.8723 -7.0384 -6.61742 CHEMBL396772 -51.4111 -34.4904 36.0326 -6.5443 -5.82207 ........
I would like your advice about what other machine learning algorithm I could try with these data. E.g. can I make a decision tree or the observations and variable are too few to avoid overfitting? I could include more variables but the observations will always remain 42.
I would greatly appreciate any advice!
Thomas
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- Yours sincerely, https://www.linkedin.com/in/alexey-dral Alexey A. Dral
On 1 October 2016 at 20:48, Алексей Драль <aadral@gmail.com> wrote:
Hi Thomas,
What quality do you have on training?
There is no silver bullet, but there is quite common technique you can use to find out if you use appropriate algorithm. You can take a look at the difference between "train" and "validation" quality of learning curves ( example <http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_c...>). If you see big gap, then you can reduce complexity of your model to overcome overfitting (reduce interaction parameter / number of variables / iterations / ...). If you see a small gap, then you can try to increase model complexity to fit your data better.
Hi Алексей,
the "Training examples" in the learning curves are the number of observations used for training? Don't you think my dataset is kind of small (42 observations) to use that technique?
Moreover, I see you have a tiny dataset and use 50/50 split. I presume, that you will train "production" model on the whole available dataset. In that case, I suggest you to use more data for training and use almost LOO <http://scikit-learn.org/stable/modules/cross_validation.html#leave-one-out-l...> approach to better estimate your predictive quality. But, be really cautious about cross-validation as you can easily overfit your data.
2016-10-02 13:23 GMT+01:00 Thomas Evangelidis <tevang3@gmail.com>:
On 1 October 2016 at 20:48, Алексей Драль <aadral@gmail.com> wrote:
Hi Thomas,
What quality do you have on training?
There is no silver bullet, but there is quite common technique you can use to find out if you use appropriate algorithm. You can take a look at the difference between "train" and "validation" quality of learning curves ( example <http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_c...>). If you see big gap, then you can reduce complexity of your model to overcome overfitting (reduce interaction parameter / number of variables / iterations / ...). If you see a small gap, then you can try to increase model complexity to fit your data better.
Hi Алексей,
the "Training examples" in the learning curves are the number of observations used for training? Don't you think my dataset is kind of small (42 observations) to use that technique?
Yes, it is really a tiny dataset =). You don't necessarily need to use it over number of training observations. For instance, you can have this plot over number of iterations.
Moreover, I see you have a tiny dataset and use 50/50 split. I presume, that you will train "production" model on the whole available dataset. In that case, I suggest you to use more data for training and use almost LOO <http://scikit-learn.org/stable/modules/cross_validation.html#leave-one-out-l...> approach to better estimate your predictive quality. But, be really cautious about cross-validation as you can easily overfit your data.
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- Yours sincerely, https://www.linkedin.com/in/alexey-dral Alexey A. Dral
Maybe it’s worth switching to LOOCV since you may have a bit of a pessimistic bias here due to the small training set size (in bootstrap you only have asymptotically 0.632 unique samples for training). I would try both linear and nonlinear models; instead of adding more features maybe also try to eliminate some features via L1, feature selection, or feature extraction in addition to trying different algorithms like random forests, gaussian processes, RBF kernel SVM regression, and so forth.
On Oct 1, 2016, at 10:59 AM, Thomas Evangelidis <tevang3@gmail.com> wrote:
Dear scikit-learn users and developers,
I have a dataset consisting of 42 observation (molnames) and 4 variables (VDWAALS, EEL, EGB, ESURF) with which I want to make a predictive model that estimates the experimental value (Expr). I tried multivariate linear regression using 10,000 bootstrap repeats each time using 21 observations for training and the rest 21 for testing, but the average correlation was only R= 0.1727 +- 0.19779.
molname VDWAALS EEL EGB ESURF Expr CHEMBL108457 -20.4848 -96.5826 23.4584 -5.4045 -7.27193 CHEMBL388269 -50.3860 28.9403 -51.5147 -6.4061 -6.8022 CHEMBL244078 -49.1466 -21.9869 17.7999 -6.4588 -6.61742 CHEMBL244077 -53.4365 -32.8943 34.8723 -7.0384 -6.61742 CHEMBL396772 -51.4111 -34.4904 36.0326 -6.5443 -5.82207 ........
I would like your advice about what other machine learning algorithm I could try with these data. E.g. can I make a decision tree or the observations and variable are too few to avoid overfitting? I could include more variables but the observations will always remain 42.
I would greatly appreciate any advice!
Thomas
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
participants (4)
-
Eric Ma -
Sebastian Raschka -
Thomas Evangelidis -
Алексей Драль