[scikit-learn] sample_weights in RandomForestRegressor
jbbrown at kuhp.kyoto-u.ac.jp
Mon Jul 16 10:54:55 EDT 2018
Your strategy for model development is built on the assumption that the SAR
(structure-activity relationship) is a continuous manifold constructed for
your compound descriptors.
However, SARs for many proteins in drug discovery or chemical biology are
not continuous (consider kinase inhibitors).
Therefore, you must make an assessment of the training data SAR to check
for the prevalence of activity cliffs.
There are at least two ways you can go about this:
(1) Simply compute all pairwise similarities by your choice of
descriptor+metric, then identify where there are pairs (e.g.,
MACCS-Tanimoto > 0.7) with large activity differences (e.g., K_i or IC50
difference of more than 10/50/100-fold; again, the biology of your problem
determines the right values).
(2) Perform many repetitions of train-test splitting on the 709 reference
molecules, look at the distribution of your evaluation metric, and see if
there is a limit in your ability to predict. If you are hitting a wall in
terms of predictability (metric performance), it's a likely sign there is
an activity cliff, and no amount of machine learning is going to be able to
overcome this. Further, trace the predictability of individual compounds to
identify those which consistently are predicted wrong. If you combine this
with analysis (1), you can know exactly which of your chemistries are
If you find that there are no activity cliffs in your dataset, then your
application of the assumption that chemical similarity implies biological
endpoint similarity will hold, and your experimental design is validated
because of the presence of a continuous manifold.
However, if you do have activity cliffs, then as awesome as sklearn is, it
still cannot make the computational chemistry any better.
Hope this helps you contextualize your work. Don't hesitate to contact me
if I can be of consultation.
Kyoto University Graduate School of Medicine
2018-07-16 8:51 GMT+09:00 Thomas Evangelidis <tevang3 at gmail.com>:
> I am kind of confused about the use of sample_weights parameter in the
> fit() function of RandomForestRegressor. Here is my problem:
> I am trying to predict the binding affinity of small molecules to a
> protein. I have a training set of 709 molecules and a blind test set of 180
> molecules. I want to find those features that are more important for the
> correct prediction of the binding affinity of those 180 molecules of my
> blind test set. My rationale is that if I give more emphasis to the
> similar molecules in the training set, then I will get higher importances
> for those features that have higher predictive ability for this specific
> blind test set of 180 molecules. To this end, I weighted the 709 training
> set molecules by their maximum similarity to the 180 molecules, selected
> only those features with high importance and trained a new RF with all 709
> molecules. I got some results but I am not satisfied. Is this the right way
> to use sample_weights in RF. I would appreciate any advice or suggested
> work flow.
> Dr Thomas Evangelidis
> Post-doctoral Researcher
> CEITEC - Central European Institute of Technology
> Masaryk University
> Kamenice 5/A35/2S049,
> 62500 Brno, Czech Republic
> email: tevang at pharm.uoa.gr
> tevang3 at gmail.com
> website: https://sites.google.com/site/thomasevangelidishomepage/
> scikit-learn mailing list
> scikit-learn at python.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the scikit-learn