[scikit-learn] combining datasets from different sources

Thomas Evangelidis tevang3 at gmail.com
Thu Sep 7 09:57:01 EDT 2017


On 7 September 2017 at 15:29, Maciek Wójcikowski <maciek at wojcikowski.pl>
wrote:

> I think StandardScaller is what you want. For each assay you will get mean
> and var. Average mean would be the "optimal" shift and average variance the
> spread. But would this value make any physical sense?
>
> ​I think you missed my point. The problem was scaling with restraints, the
RMSD between the binding affinity of the common ligands ​must be minimized
uppon scaling. Anyway, I managed to work it out using scipy.optimize.




> Considering the RF-Score-VS: In fact it's a regressor and it predicts a
> real value, not a class. Although it is validated mostly using Enrichment
> Factor, the last figure shows top results for regression vs Vina.
>
> ​To my understanding you trained the RF using class information (active,
inactive) and the prediction was a probability value.​ If the probability
is above 0.5 then the compound is an active, otherwise it is an inactive.
This is how sklearn.ensemble.RandomForestClassifier works.

In contrast I train MLPRegressors using binding affinities (scalar values)
and the predictions are binding affinities (scallar values).





> ----
> Pozdrawiam,  |  Best regards,
> Maciek Wójcikowski
> maciek at wojcikowski.pl
>
> 2017-09-06 20:48 GMT+02:00 Thomas Evangelidis <tevang3 at gmail.com>:
>
>> ​​
>> After some though about this problem today, I think it is an objective
>> function minimization problem, when the objective function can be the root
>> mean square deviation (RMSD) between the affinities of the common molecules
>> in the two data sets. I could work iteratively, first rescale and fit assay
>> B to match A, then proceed to assay C and so forth. Or alternatively, for
>> each Assay I need to find two missing variables, the optimum shift Sh and
>> the scale Sc. So if I have 3 Assays A, B, C lets say, I am looking for the
>> optimum values of Sh_A, Sc_A, Sh_B, Sc_B, Sh_C, Sc_C that minimize the RMSD
>> between the binding affinities of the overlapping molecules. Any idea how I
>> can do that with scikit-learn?
>>
>>
>> On 6 September 2017 at 00:29, Thomas Evangelidis <tevang3 at gmail.com>
>> wrote:
>>
>>> Thanks Jason, Sebastian and Maciek!
>>>
>>> I believe from all the suggestions, the most feasible solutions is to
>>> look experimental assays which overlap by at least two compounds, and then
>>> adjust the binding affinities of one of them by looking in their difference
>>> in both assays. Sebastian mentioned the simplest scenario, where the shift
>>> for both compounds is 2 kcal/mol. However, he neglected to mention that the
>>> ratio between the affinities of the two compounds in each assay also
>>> matters. Specifically, the ratio Ka/Kb=-7/-9=0.78 in assay A but
>>> -10/-12=0.83 in assay B. Ideally that should also be taken into account to
>>> select the right transformation function for the values from Assay B. Is
>>> anybody away of any clever algorithm to select the right transformation
>>> function for such a case? I am sure there exists.
>>>
>>> The other approach would be to train different predictors from each
>>> assay and then apply a data fusion technique (e.g. min rank). But that
>>> wouldn't be that elegant.
>>>
>>> @Maciek To my understanding, the paper you cited addresses a
>>> classification problem (actives, inactives) by implementing Random Forrest
>>> Classfiers. My case is a Regression problem.
>>>
>>>
>>> best,
>>> Thomas
>>>
>>>
>>> On 5 September 2017 at 20:33, Maciek Wójcikowski <maciek at wojcikowski.pl>
>>> wrote:
>>>
>>>> Hi Thomas and others,
>>>>
>>>> It also really depend on how many data points you have on each
>>>> compound. If you had more than a few then there are few options. If you get
>>>> two very distinct activities for one ligand. I'd discard such samples as
>>>> ambiguous or decide on one of the assays/experiments (the one with lower
>>>> error). The exact problem was faced by PDBbind creators, I'd also look
>>>> there for details what they did with their activities.
>>>>
>>>> To follow up Sebastians suggestion: have you checked how different
>>>> ranks/Z-scores you get? Check out the Kendall Tau.
>>>>
>>>> Anyhow, you could build local models for a specific experimental
>>>> methods. In our recent publication on slightly different area
>>>> (protein-ligand scoring function), we show that the RF build on one target
>>>> is just slightly better than the RF build on many targets (we've used DUD-E
>>>> database); Checkout the "horizontal" and "per-target" splits
>>>> https://www.nature.com/articles/srep46710. Unfortunately, this may
>>>> change for different models. Plus the molecular descriptors used, which we
>>>> know nothing about.
>>>>
>>>> I hope that helped a bit.
>>>>
>>>> ----
>>>> Pozdrawiam,  |  Best regards,
>>>> Maciek Wójcikowski
>>>> maciek at wojcikowski.pl
>>>>
>>>> 2017-09-05 19:35 GMT+02:00 Sebastian Raschka <se.raschka at gmail.com>:
>>>>
>>>>> Another approach would be to pose this as a "ranking" problem to
>>>>> predict relative affinities rather than absolute affinities. E.g., if you
>>>>> have data from one (or more) molecules that has/have been tested under 2 or
>>>>> more experimental conditions, you can rank the other molecules accordingly
>>>>> or normalize. E.g. if you observe that the binding affinity of molecule a
>>>>> is -7 kcal/mol in assay A and -9 kcal/mol in assay to, and say the binding
>>>>> affinities of molecule B are -10 and -12 kcal/mol, respectively, that
>>>>> should give you some information for normalizing the values from assay 2
>>>>> (e.g., by adding 2 kcal/mol). Of course this is not a perfect solution and
>>>>> might be error prone, but so are experimental assays ... (when I sometimes
>>>>> look at the std error/CI of the data I get from collaborators ... well, it
>>>>> seems that absolute binding affinities have always taken with a grain of
>>>>> salt anyway)
>>>>>
>>>>> Best,
>>>>> Sebastian
>>>>>
>>>>> > On Sep 5, 2017, at 1:02 PM, Jason Rudy <jcrudy at gmail.com> wrote:
>>>>> >
>>>>> > Thomas,
>>>>> >
>>>>> > This is sort of related to the problem I did my M.S. thesis on years
>>>>> ago: cross-platform normalization of gene expression data.  If you google
>>>>> that term you'll find some papers.  The situation is somewhat different,
>>>>> though, because with microarrays or RNA-seq you get thousands of data
>>>>> points for each experiment, which makes it easier to estimate the batch
>>>>> effect.  The principle is the similar, however.
>>>>> >
>>>>> > If I were in your situation, I would consider whether I have any of
>>>>> the following advantages:
>>>>> >
>>>>> > 1. Some molecules that appear in multiple data sets
>>>>> > 2. Detailed information about the different experimental conditions
>>>>> > 3. Physical/chemical models of how experimental conditions influence
>>>>> binding affinity
>>>>> >
>>>>> > If you have any of the above, you can potentially use them to
>>>>> improve your estimates.  You could also consider using experiment ID as a
>>>>> categorical predictor in a sufficiently general regression method.
>>>>> >
>>>>> > Lastly, you may already know this, but the term "meta-analysis" is
>>>>> relevant here, and you can google for specific techniques.  Most of these
>>>>> would be more limited than what you are envisioning, I think.
>>>>> >
>>>>> > Best,
>>>>> >
>>>>> > Jason
>>>>> >
>>>>> > On Tue, Sep 5, 2017 at 6:39 AM, Thomas Evangelidis <
>>>>> tevang3 at gmail.com> wrote:
>>>>> > Greetings,
>>>>> >
>>>>> > I am working on a problem that involves predicting the binding
>>>>> affinity of small molecules on a receptor structure (is regression problem,
>>>>> not classification). I have multiple small datasets of molecules with
>>>>> measured binding affinities on a receptor, but each dataset was measured in
>>>>> different experimental conditions and therefore I cannot use them all
>>>>> together as trainning set. So, instead of using them individually, I was
>>>>> wondering whether there is a method to combine them all into a super
>>>>> training set. The first way I could think of is to convert the binding
>>>>> affinities to Z-scores and then combine all the small datasets of
>>>>> molecules. But this is would be inaccurate because, firstly the datasets
>>>>> are very small (10-50 molecules each), and secondly, the range of binding
>>>>> affinities differs in each experiment (some datasets contain really strong
>>>>> binders, while others do not, etc.). Is there any other approach to combine
>>>>> datasets with values coming from different sources? Maybe if som
>>>>>  eone points me to the right reference I could read and understand if
>>>>> it is applicable to my case.
>>>>> >
>>>>> > Thanks,
>>>>> > Thomas
>>>>> >
>>>>> > --
>>>>> > ============================================================
>>>>> ==========
>>>>> > Dr Thomas Evangelidis
>>>>> > Post-doctoral Researcher
>>>>> > CEITEC - Central European Institute of Technology
>>>>> > Masaryk University
>>>>> > Kamenice 5/A35/2S049,
>>>>> > 62500 Brno, Czech Republic
>>>>> >
>>>>> > email: tevang at pharm.uoa.gr
>>>>> >               tevang3 at gmail.com
>>>>> >
>>>>> > website: https://sites.google.com/site/thomasevangelidishomepage/
>>>>> >
>>>>> >
>>>>> >
>>>>> > _______________________________________________
>>>>> > scikit-learn mailing list
>>>>> > scikit-learn at python.org
>>>>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>>>> >
>>>>> >
>>>>> > _______________________________________________
>>>>> > scikit-learn mailing list
>>>>> > scikit-learn at python.org
>>>>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> ======================================================================
>>>
>>> Dr Thomas Evangelidis
>>>
>>> Post-doctoral Researcher
>>> CEITEC - Central European Institute of Technology
>>> Masaryk University
>>> Kamenice 5/A35/2S049,
>>> 62500 Brno, Czech Republic
>>>
>>> email: tevang at pharm.uoa.gr
>>>
>>>           tevang3 at gmail.com
>>>
>>>
>>> website: https://sites.google.com/site/thomasevangelidishomepage/
>>>
>>>
>>
>>
>> --
>>
>> ======================================================================
>>
>> Dr Thomas Evangelidis
>>
>> Post-doctoral Researcher
>> CEITEC - Central European Institute of Technology
>> Masaryk University
>> Kamenice 5/A35/2S049,
>> 62500 Brno, Czech Republic
>>
>> email: tevang at pharm.uoa.gr
>>
>>           tevang3 at gmail.com
>>
>>
>> website: https://sites.google.com/site/thomasevangelidishomepage/
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 

======================================================================

Dr Thomas Evangelidis

Post-doctoral Researcher
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/2S049,
62500 Brno, Czech Republic

email: tevang at pharm.uoa.gr

          tevang3 at gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170907/2c240b3b/attachment-0001.html>


More information about the scikit-learn mailing list