[scikit-learn] Need for multioutput multivariate algorithm for Random Forest in Python (using Mahalanobis distance)

Mon Feb 17 18:34:00 EST 2020


On 2/14/20 5:47 PM, Paul Chike Ofoche via scikit-learn wrote:
>
> Many thanks Nicolas and Andreas.
>
>
>
> I was wondering whether this multioutput handling capability of the 
> RandomForestRegressor has been added recently. In order to verify, I 
> went on a fact-finding mission by re-running the exact same codes I 
> had in 2018 and noticed quite a number of changes. I guess that many 
> moons have passed since then!
>
> For instance, sklearn.cross_validation has been deprecated since when 
> last I used it in 2018 (and replaced by sklearn.model_selection). 
> Also, such errors as:
>
> i. ValueError: Expected 2D array, got scalar array instead:
>
> array=6.5.
>
> Reshape your data either using array.reshape(-1, 1) if your data has a 
> single feature or array.reshape(1, -1) if it contains a single sample.
>
> and
>
> ii. DataConversionWarning: A column-vector y was passed when a 1d 
> array was expected. Please change the shape of y to (n_samples,), for 
> example using ravel().
>
All of these were errors in 2018 already, you might not have had the 
most up-to-date version then ;)
cross_validation was deprecated in 2016:
https://scikit-learn.org/dev/whats_new/v0.18.html#version-0-18

> when passing a *scalar* and a *column-vector y* respectively are 
> entirely new from when last I made use of Python’s 
> RandomForestRegressor. Previously, they worked just fine without 
> throwing out any errors. I know that the “multioutputs” were handled 
> back in 2018 (I actually tested this capability back then), but I 
> assumed that the regressors were fit per target i.e. that there was no 
> correlation between targets.
>
I can't find a changelog entry but pretty sure this goes back to 2014 or 
so. Definitely it was present in 2018.
>
> Today, for comparison, I generated some random target outputs (three 
> columns) and using the same *random_state*, I ran the all-inclusive 
> multioutput prediction (with all three output targets simultaneously 
> vs. re-running each output prediction one at a time). The results are 
> different, implying that some form of correlation takes place amongst 
> the multioutput targets, when predicted together. (For completeness, I 
> display the first 28 predicted output values, from the multioutput 
> prediction as well as the single output predictions.
>
>
>
>
> For my knowledge’s sake, could you please inform me about the 
> technique being employed now to take advantage of the correlations 
> between targets? Is it the Mahalanobis distance or some other metric? 
> In other words, could you please give me a hint as to the underlying 
> reason why the single output predictions differ from the multioutput 
> predictions? I am curious to know as this would finally fully quench 
> my appetite after nearly two years. I will have to retrace my steps 
> and get back to the good old Python ways (again). Thank you.
>
It doesn't explicitly use the correlation. The splitting criterion is is 
the sum over the splitting criteria over the outputs. That means there's 
an implicit regularization as the tree is shared between the targets.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200217/ae8931f5/attachment.html>