[scikit-learn] anti-correlated predictions by SVR

Sebastian Raschka se.raschka at gmail.com
Tue Sep 26 12:58:00 EDT 2017


I'd agree with Gael that a potential explanation could be the distribution shift upon splitting (usually the smaller the dataset, the more this is of an issue). As potential solutions/workarounds, you could try

a) stratified sampling for regression, if you'd like to stick with the 2-way holdout method
b) use leave-one-out cross validation for evaluation (your model will likely benefit from the additional training samples)
c) use leave-one-out boostrap (at each round, draw a bootstrap sample from the dataset for training, then use the points not in the training dataset for testing)

Best,
Sebastian

> On Sep 26, 2017, at 12:48 PM, Thomas Evangelidis <tevang3 at gmail.com> wrote:
> 
> I have very small training sets (10-50 observations). Currently, I am working with 16 observations for training and 25 for validation (external test set). And I am doing Regression, not Classification (hence the SVR instead of SVC).
> 
> 
> On 26 September 2017 at 18:21, Gael Varoquaux <gael.varoquaux at normalesup.org> wrote:
> Hypothesis: you have a very small dataset and when you leave out data,
> you create a distribution shift between the train and the test. A
> simplified example: 20 samples, 10 class a, 10 class b. A leave-one-out
> cross-validation will create a training set of 10 samples of one class, 9
> samples of the other, and the test set is composed of the class that is
> minority on the train set.
> 
> G
> 
> On Tue, Sep 26, 2017 at 06:10:39PM +0200, Thomas Evangelidis wrote:
> > Greetings,
> 
> > I don't know if anyone encountered this before, but sometimes I get
> > anti-correlated predictions by the SVR I that am training. Namely, the
> > Pearson's R and Kendall's tau are negative when I compare the predictions on
> > the external test set with the true values. However, the SVR predictions on the
> > training set have positive correlations with the experimental values and hence
> > I can't think of a way to know in advance if the trained SVR will produce
> > anti-correlated predictions in order to change their sign and avoid the
> > disaster. Here is an example of what I mean:
> 
> > Training set predictions: R=0.452422, tau=0.333333
> > External test set predictions: R=-0.537420, tau-0.300000
> 
> > Obviously, in a real case scenario where I wouldn't have the external test set
> > I would have used the worst observation instead of the best ones. Has anybody
> > any idea about how I could prevent this?
> 
> > thanks in advance
> > Thomas
> --
>     Gael Varoquaux
>     Researcher, INRIA Parietal
>     NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
>     Phone:  ++ 33-1-69-08-79-68
>     http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> 
> -- 
> ======================================================================
> Dr Thomas Evangelidis
> Post-doctoral Researcher
> CEITEC - Central European Institute of Technology
> Masaryk University
> Kamenice 5/A35/2S049, 
> 62500 Brno, Czech Republic 
> 
> email: tevang at pharm.uoa.gr
>          	tevang3 at gmail.com
> 
> website: https://sites.google.com/site/thomasevangelidishomepage/
> 
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn



More information about the scikit-learn mailing list