[scikit-learn] combining datasets from different sources

Maciek Wójcikowski maciek at wojcikowski.pl
Tue Sep 5 14:33:39 EDT 2017

Hi Thomas and others,

It also really depend on how many data points you have on each compound. If
you had more than a few then there are few options. If you get two very
distinct activities for one ligand. I'd discard such samples as ambiguous
or decide on one of the assays/experiments (the one with lower error). The
exact problem was faced by PDBbind creators, I'd also look there for
details what they did with their activities.

To follow up Sebastians suggestion: have you checked how different
ranks/Z-scores you get? Check out the Kendall Tau.

Anyhow, you could build local models for a specific experimental methods.
In our recent publication on slightly different area (protein-ligand
scoring function), we show that the RF build on one target is just slightly
better than the RF build on many targets (we've used DUD-E database);
Checkout the "horizontal" and "per-target" splits
https://www.nature.com/articles/srep46710. Unfortunately, this may change
for different models. Plus the molecular descriptors used, which we know
nothing about.

I hope that helped a bit.

Pozdrawiam,  |  Best regards,
Maciek Wójcikowski
maciek at wojcikowski.pl

2017-09-05 19:35 GMT+02:00 Sebastian Raschka <se.raschka at gmail.com>:

> Another approach would be to pose this as a "ranking" problem to predict
> relative affinities rather than absolute affinities. E.g., if you have data
> from one (or more) molecules that has/have been tested under 2 or more
> experimental conditions, you can rank the other molecules accordingly or
> normalize. E.g. if you observe that the binding affinity of molecule a is
> -7 kcal/mol in assay A and -9 kcal/mol in assay to, and say the binding
> affinities of molecule B are -10 and -12 kcal/mol, respectively, that
> should give you some information for normalizing the values from assay 2
> (e.g., by adding 2 kcal/mol). Of course this is not a perfect solution and
> might be error prone, but so are experimental assays ... (when I sometimes
> look at the std error/CI of the data I get from collaborators ... well, it
> seems that absolute binding affinities have always taken with a grain of
> salt anyway)
> Best,
> Sebastian
> > On Sep 5, 2017, at 1:02 PM, Jason Rudy <jcrudy at gmail.com> wrote:
> >
> > Thomas,
> >
> > This is sort of related to the problem I did my M.S. thesis on years
> ago: cross-platform normalization of gene expression data.  If you google
> that term you'll find some papers.  The situation is somewhat different,
> though, because with microarrays or RNA-seq you get thousands of data
> points for each experiment, which makes it easier to estimate the batch
> effect.  The principle is the similar, however.
> >
> > If I were in your situation, I would consider whether I have any of the
> following advantages:
> >
> > 1. Some molecules that appear in multiple data sets
> > 2. Detailed information about the different experimental conditions
> > 3. Physical/chemical models of how experimental conditions influence
> binding affinity
> >
> > If you have any of the above, you can potentially use them to improve
> your estimates.  You could also consider using experiment ID as a
> categorical predictor in a sufficiently general regression method.
> >
> > Lastly, you may already know this, but the term "meta-analysis" is
> relevant here, and you can google for specific techniques.  Most of these
> would be more limited than what you are envisioning, I think.
> >
> > Best,
> >
> > Jason
> >
> > On Tue, Sep 5, 2017 at 6:39 AM, Thomas Evangelidis <tevang3 at gmail.com>
> wrote:
> > Greetings,
> >
> > I am working on a problem that involves predicting the binding affinity
> of small molecules on a receptor structure (is regression problem, not
> classification). I have multiple small datasets of molecules with measured
> binding affinities on a receptor, but each dataset was measured in
> different experimental conditions and therefore I cannot use them all
> together as trainning set. So, instead of using them individually, I was
> wondering whether there is a method to combine them all into a super
> training set. The first way I could think of is to convert the binding
> affinities to Z-scores and then combine all the small datasets of
> molecules. But this is would be inaccurate because, firstly the datasets
> are very small (10-50 molecules each), and secondly, the range of binding
> affinities differs in each experiment (some datasets contain really strong
> binders, while others do not, etc.). Is there any other approach to combine
> datasets with values coming from different sources? Maybe if som
>  eone points me to the right reference I could read and understand if it
> is applicable to my case.
> >
> > Thanks,
> > Thomas
> >
> > --
> > ======================================================================
> > Dr Thomas Evangelidis
> > Post-doctoral Researcher
> > CEITEC - Central European Institute of Technology
> > Masaryk University
> > Kamenice 5/A35/2S049,
> > 62500 Brno, Czech Republic
> >
> > email: tevang at pharm.uoa.gr
> >               tevang3 at gmail.com
> >
> > website: https://sites.google.com/site/thomasevangelidishomepage/
> >
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170905/9efa8ed1/attachment-0001.html>

More information about the scikit-learn mailing list