[scikit-learn] suggested machine learning algorithm
se.raschka at gmail.com
Sat Oct 1 15:58:39 EDT 2016
Maybe it’s worth switching to LOOCV since you may have a bit of a pessimistic bias here due to the small training set size (in bootstrap you only have asymptotically 0.632 unique samples for training). I would try both linear and nonlinear models; instead of adding more features maybe also try to eliminate some features via L1, feature selection, or feature extraction in addition to trying different algorithms like random forests, gaussian processes, RBF kernel SVM regression, and so forth.
> On Oct 1, 2016, at 10:59 AM, Thomas Evangelidis <tevang3 at gmail.com> wrote:
> Dear scikit-learn users and developers,
> I have a dataset consisting of 42 observation (molnames) and 4 variables (VDWAALS, EEL, EGB, ESURF) with which I want to make a predictive model that estimates the experimental value (Expr). I tried multivariate linear regression using 10,000 bootstrap repeats each time using 21 observations for training and the rest 21 for testing, but the average correlation was only R= 0.1727 +- 0.19779.
> molname VDWAALS EEL EGB ESURF Expr
> CHEMBL108457 -20.4848 -96.5826 23.4584 -5.4045 -7.27193
> CHEMBL388269 -50.3860 28.9403 -51.5147 -6.4061 -6.8022
> CHEMBL244078 -49.1466 -21.9869 17.7999 -6.4588 -6.61742
> CHEMBL244077 -53.4365 -32.8943 34.8723 -7.0384 -6.61742
> CHEMBL396772 -51.4111 -34.4904 36.0326 -6.5443 -5.82207
> I would like your advice about what other machine learning algorithm I could try with these data. E.g. can I make a decision tree or the observations and variable are too few to avoid overfitting? I could include more variables but the observations will always remain 42.
> I would greatly appreciate any advice!
> scikit-learn mailing list
> scikit-learn at python.org
More information about the scikit-learn