[scikit-learn] suggested machine learning algorithm

Sat Oct 1 14:37:35 EDT 2016

Hi Thomas,

A number of people I've learned from have given me the following "recipe",
which I hold to loosely.

   1. Start with Random Forest - it should be able to give you good
   baseline predictive capacity.
   2. Let's say you don't care about interpretability, but only care about
   predictive value.  Keep tweaking RF parameters (use grid search + cross
   validation), or switch to gradient boosting.
   3. Let's say you do care about interpretability. Use RF's
   feature_importances_ to get out the features that are important for
   prediction. Try linear regression on just those, may also want to try
   multiplying those features together to get the "interaction" product of
   those features. (this is using RF as a feature selection method).

Beyond this, I am sure more "expert" types will be able to chime in, and
also correct me if I've said anything wrong here.

Cheers
Eric

On Sat, Oct 1, 2016 at 10:59 AM, Thomas Evangelidis <tevang3 at gmail.com>
wrote:

> Dear scikit-learn users and developers,
>
> I have a dataset consisting of 42 observation (molnames) and 4 variables (
> VDWAALS, EEL, EGB, ESURF) with which I want to make a predictive model
> that estimates the experimental value (Expr). I tried multivariate linear
> regression using 10,000 bootstrap repeats each time using 21 observations
> for training and the rest 21 for testing, but the average correlation was
> only R= 0.1727 +- 0.19779.
>
>
> molname                    VDWAALS     EEL               EGB
>>  ESURF        Expr
>> CHEMBL108457        -20.4848        -96.5826         23.4584
>> -5.4045        -7.27193
>> CHEMBL388269        -50.3860         28.9403        -51.5147
>> -6.4061        -6.8022
>> CHEMBL244078        -49.1466        -21.9869         17.7999
>> -6.4588        -6.61742
>> CHEMBL244077        -53.4365        -32.8943         34.8723
>> -7.0384        -6.61742
>> CHEMBL396772        -51.4111        -34.4904         36.0326
>> -6.5443        -5.82207
>> ........
>
>
> I would like your advice about what other machine learning algorithm I
> could try with these data. E.g. can I make a decision tree or the
> observations  and variable are too few to avoid overfitting? I could
> include more variables but the observations will always remain 42.
>
> I would greatly appreciate any advice!
>
> Thomas
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161001/a3279666/attachment.html>