[scikit-learn] suggested machine learning algorithm

Алексей Драль aadral at gmail.com
Sat Oct 1 14:48:44 EDT 2016


Hi Thomas,

What quality do you have on training?

There is no silver bullet, but there is quite common technique you can use
to find out if you use appropriate algorithm. You can take a look at the
difference between "train" and "validation" quality of learning curves (
example
<http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html#example-model-selection-plot-learning-curve-py>).
If you see big gap, then you can reduce complexity of your model to
overcome overfitting (reduce interaction parameter / number of variables /
iterations / ...). If you see a small gap, then you can try to increase
model complexity to fit your data better.

Moreover, I see you have a tiny dataset and use 50/50 split. I presume,
that you will train "production" model on the whole available dataset. In
that case, I suggest you to use more data for training and use almost LOO
<http://scikit-learn.org/stable/modules/cross_validation.html#leave-one-out-loo>
approach
to better estimate your predictive quality. But, be really cautious about
cross-validation as you can easily overfit your data.


2016-10-01 15:59 GMT+01:00 Thomas Evangelidis <tevang3 at gmail.com>:

> Dear scikit-learn users and developers,
>
> I have a dataset consisting of 42 observation (molnames) and 4 variables (
> VDWAALS, EEL, EGB, ESURF) with which I want to make a predictive model
> that estimates the experimental value (Expr). I tried multivariate linear
> regression using 10,000 bootstrap repeats each time using 21 observations
> for training and the rest 21 for testing, but the average correlation was
> only R= 0.1727 +- 0.19779.
>
>
> molname                    VDWAALS     EEL               EGB
>>  ESURF        Expr
>> CHEMBL108457        -20.4848        -96.5826         23.4584
>> -5.4045        -7.27193
>> CHEMBL388269        -50.3860         28.9403        -51.5147
>> -6.4061        -6.8022
>> CHEMBL244078        -49.1466        -21.9869         17.7999
>> -6.4588        -6.61742
>> CHEMBL244077        -53.4365        -32.8943         34.8723
>> -7.0384        -6.61742
>> CHEMBL396772        -51.4111        -34.4904         36.0326
>> -6.5443        -5.82207
>> ........
>
>
> I would like your advice about what other machine learning algorithm I
> could try with these data. E.g. can I make a decision tree or the
> observations  and variable are too few to avoid overfitting? I could
> include more variables but the observations will always remain 42.
>
> I would greatly appreciate any advice!
>
> Thomas
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
Yours sincerely,
https://www.linkedin.com/in/alexey-dral
Alexey A. Dral
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161001/46025b3c/attachment.html>


More information about the scikit-learn mailing list