Need for multioutput multivariate algorithm for Random Forest in Python (using Mahalanobis distance)
Hello all, My name is Paul and I am enthused about data science. I have been using Python and other programming languages for close to two years. There is an issue that I have been facing since I began applying Python to the analysis of my research work. My question has remained unanswered for months. Has anybody not run into the need to work with data whereby the regression results are a multiple output, in which the output parameters are correlated with each other? This is called a multi-output multivariate problem. A version of random forest that handles multiple outputs is referred to as the multivariate random forest. It is implemented in the programming language, R (see attached reference documentation below). Till date, there exists no such package in Python. My question is whether anybody knows how to go about implementing this. The random forest univariate regression case utilizes the Euclidean distance as the measurement criteria, whereas the multivariate regression case uses the Mahalanobis distance, which takes into account the inter-relationships between the multiple outputs. I have inquired about an equivalent capability in Python for many years, but it has still not been addressed. Such a multivariate random forest mode is very applicable to the type of research and analysis that I do. Could someone help, please? Thank you, Paul Ofoche PS: This is an important need for multivariate output analysis as a technique to solving practical research problems. Here are some posted questions by various other Python users concerning this same issue. https://datascience.stackexchange.com/questions/21637/code-for-multivariate-... Multi-output regression | | | | | | | | | | | Multi-output regression I have been looking in to Multi-output regression the last view weeks. I am working with the scikit learn packag... | | |
Speaking as an ignorant, lurker/nonuser of sklearn, the way I see this being handled in neural nets is https://keras.io/examples/cifar10_cnn/ model.add(Dense(num_classes)) model.add(Activation('softmax')) Not sure if that will map to sklearn. Bill On 2/9/20 12:21 PM, Paul Chike Ofoche via scikit-learn wrote:
Hello all,
My name is Paul and I am enthused about data science. I have been using Python and other programming languages for close to two years. There is an issue that I have been facing since I began applying Python to the analysis of my research work.
My question has remained unanswered for months. Has anybody not run into the need to work with data whereby the regression results are a multiple output, in which the output parameters are correlated with each other? This is called a multi-output multivariate problem. A version of random forest that handles multiple outputs is referred to as the multivariate random forest. It is implemented in the programming language, R (see attached reference documentation below).
Till date, there exists no such package in Python. My question is whether anybody knows how to go about implementing this. The random forest univariate regression case utilizes the Euclidean distance as the measurement criteria, whereas the multivariate regression case uses the Mahalanobis distance, which takes into account the inter-relationships between the multiple outputs. I have inquired about an equivalent capability in Python for many years, but it has still not been addressed. Such a multivariate random forest mode is very applicable to the type of research and analysis that I do. Could someone help, please?
Thank you,
Paul Ofoche
PS: This is an important need for multivariate output analysis as a technique to solving practical research problems. Here are some posted questions by various other Python users concerning this same issue.
*https://datascience.stackexchange.com/questions/21637/code-for-multivariate-...
Multi-output regression <https://stackoverflow.com/questions/49391637/multi-output-regression>
Multi-output regression
I have been looking in to Multi-output regression the last view weeks. I am working with the scikit learn packag...
<https://stackoverflow.com/questions/49391637/multi-output-regression>
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
On 2/9/20 12:21 PM, Paul Chike Ofoche via scikit-learn wrote:
Hello all,
My name is Paul and I am enthused about data science. I have been using Python and other programming languages for close to two years. There is an issue that I have been facing since I began applying Python to the analysis of my research work.
My question has remained unanswered for months. Has anybody not run into the need to work with data whereby the regression results are a multiple output, in which the output parameters are correlated with each other? This is called a multi-output multivariate problem. A version of random forest that handles multiple outputs is referred to as the multivariate random forest. It is implemented in the programming language, R (see attached reference documentation below).
The scikit-learn random forest actually handles this. It doesn't use the mahalanobis distance but that seems like a simple preprocessing step.
Till date, there exists no such package in Python. My question is whether anybody knows how to go about implementing this. The random forest univariate regression case utilizes the Euclidean distance as the measurement criteria, whereas the multivariate regression case uses the Mahalanobis distance, which takes into account the inter-relationships between the multiple outputs. I have inquired about an equivalent capability in Python for many years, but it has still not been addressed. Such a multivariate random forest mode is very applicable to the type of research and analysis that I do. Could someone help, please?
Thank you,
Paul Ofoche
PS: This is an important need for multivariate output analysis as a technique to solving practical research problems. Here are some posted questions by various other Python users concerning this same issue.
*https://datascience.stackexchange.com/questions/21637/code-for-multivariate-...
Multi-output regression <https://stackoverflow.com/questions/49391637/multi-output-regression>
Multi-output regression
I have been looking in to Multi-output regression the last view weeks. I am working with the scikit learn packag...
<https://stackoverflow.com/questions/49391637/multi-output-regression>
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Scikit-learn random forest does not handle the multi-output case, but only maps to each output one at a time, thereby not accounting for the correlation between multi-outputs, which is what the Mahalanobis distance does. I, as well as other researchers have observed this issue for as much as two years. Could there be a solution to implement it in RandomForest, since Python already has a function that computes Mahalanobis distances? On Thursday, February 13, 2020, 10:15:11 PM CST, Andreas Mueller <t3kcit@gmail.com> wrote: On 2/9/20 12:21 PM, Paul Chike Ofoche via scikit-learn wrote: Hello all, My name is Paul and I am enthused about data science. I have been using Python and other programming languages for close to two years. There is an issue that I have been facing since I began applying Python to the analysis of my research work. My question has remained unanswered for months. Has anybody not run into the need to work with data whereby the regression results are a multiple output, in which the output parameters are correlated with each other? This is called a multi-output multivariate problem. A version of random forest that handles multiple outputs is referred to as the multivariate random forest. It is implemented in the programming language, R (see attached reference documentation below). The scikit-learn random forest actually handles this. It doesn't use the mahalanobis distance but that seems like a simple preprocessing step. Till date, there exists no such package in Python. My question is whether anybody knows how to go about implementing this. The random forest univariate regression case utilizes the Euclidean distance as the measurement criteria, whereas the multivariate regression case uses the Mahalanobis distance, which takes into account the inter-relationships between the multiple outputs. I have inquired about an equivalent capability in Python for many years, but it has still not been addressed. Such a multivariate random forest mode is very applicable to the type of research and analysis that I do. Could someone help, please? Thank you, Paul Ofoche PS: This is an important need for multivariate output analysis as a technique to solving practical research problems. Here are some posted questions by various other Python users concerning this same issue. https://datascience.stackexchange.com/questions/21637/code-for-multivariate-... Multi-output regression | | | | | | | | | | | Multi-output regression I have been looking in to Multi-output regression the last view weeks. I am working with the scikit learn packag... | | | _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Hi Paul, The way multioutput is handled in decision trees (and thus in the forests) is described in https://scikit-learn.org/stable/modules/tree.html#multi-output-problems. As you can see, the correlation between the output values *is* taken into account. Can you explain what you would like to modify there? Nicolas On 2/14/20 7:37 AM, Paul Chike Ofoche via scikit-learn wrote:
Scikit-learn random forest does *not *handle the multi-output case, but only maps to each output one at a time, thereby not accounting for the correlation between multi-outputs, which is what the Mahalanobis distance does. I, as well as other researchers have observed this issue for as much as two years. Could there be a solution to implement it in RandomForest, since Python already has a function that computes Mahalanobis distances?
On Thursday, February 13, 2020, 10:15:11 PM CST, Andreas Mueller <t3kcit@gmail.com> wrote:
On 2/9/20 12:21 PM, Paul Chike Ofoche via scikit-learn wrote:
Hello all,
My name is Paul and I am enthused about data science. I have been using Python and other programming languages for close to two years. There is an issue that I have been facing since I began applying Python to the analysis of my research work.
My question has remained unanswered for months. Has anybody not run into the need to work with data whereby the regression results are a multiple output, in which the output parameters are correlated with each other? This is called a multi-output multivariate problem. A version of random forest that handles multiple outputs is referred to as the multivariate random forest. It is implemented in the programming language, R (see attached reference documentation below).
The scikit-learn random forest actually handles this. It doesn't use the mahalanobis distance but that seems like a simple preprocessing step.
Till date, there exists no such package in Python. My question is whether anybody knows how to go about implementing this. The random forest univariate regression case utilizes the Euclidean distance as the measurement criteria, whereas the multivariate regression case uses the Mahalanobis distance, which takes into account the inter-relationships between the multiple outputs. I have inquired about an equivalent capability in Python for many years, but it has still not been addressed. Such a multivariate random forest mode is very applicable to the type of research and analysis that I do. Could someone help, please?
Thank you,
Paul Ofoche
PS: This is an important need for multivariate output analysis as a technique to solving practical research problems. Here are some posted questions by various other Python users concerning this same issue.
*https://datascience.stackexchange.com/questions/21637/code-for-multivariate-...
Multi-output regression <https://stackoverflow.com/questions/49391637/multi-output-regression>
Multi-output regression
I have been looking in to Multi-output regression the last view weeks. I am working with the scikit learn packag...
<https://stackoverflow.com/questions/49391637/multi-output-regression>
_______________________________________________ scikit-learn mailing list scikit-learn@python.org <mailto:scikit-learn@python.org> https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org <mailto:scikit-learn@python.org> https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Many thanks Nicolas and Andreas. I appreciate your taking the timeand effort to look into the issue that I raised and for pointing me to thedocumentation. It is quite pleasant to know that scikit-learn’sRandomForestRegressor handles multioutput cases. This issue has been veryimportant to me and was the sole reason that I switched from Python to R for myresearch in the Fall of 2018 and have seldom used Python since then. I got convinced about my earlierstance when reading a documentation such as https://scikit-learn.org/stable/modules/multiclass.html#multioutput-regressi... explained that the “MultiOutputRegressor fits one regressor per targetand cannot take advantage of correlations between targets”, although I am awarethat this is different from the RandomForestRegressor. I was wondering whether this multioutputhandling capability of the RandomForestRegressor has been added recently. In order to verify, I went on a fact-finding missionby re-running the exact same codes I had in 2018 and noticed quite a number ofchanges. I guess that many moons have passed since then! For instance, sklearn.cross_validationhas been deprecated since when last I used it in 2018 (and replaced by sklearn.model_selection).Also, such errors as: i. ValueError: Expected 2D array, got scalar array instead: array=6.5. Reshape your data either using array.reshape(-1, 1) ifyour data has a single feature or array.reshape(1, -1) if it contains a singlesample. and ii. DataConversionWarning: A column-vector y was passed whena 1d array was expected. Please change the shape of y to (n_samples,), forexample using ravel(). when passing a scalar and a column-vector y respectively are entirely new from when last I made use ofPython’s RandomForestRegressor. Previously, they worked just fine withoutthrowing out any errors. I know that the “multioutputs” were handled back in 2018(I actually tested this capability back then), but I assumed that theregressors were fit per target i.e. that there was no correlation betweentargets. Today, for comparison, I generatedsome random target outputs (three columns) and using the same random_state, I ranthe all-inclusive multioutput prediction (with all three output targetssimultaneously vs. re-running each output prediction one at a time). The results are different, implying that some form ofcorrelation takes place amongst the multioutput targets, when predictedtogether. (For completeness, I display the first 28 predicted outputvalues, from the multioutput prediction as well as the single output predictions.) Results from the multioutput prediction of thetargets (capturing their correlations). Resultsfrom the individual prediction of each single output target. For my knowledge’s sake, could youplease inform me about the technique being employed now to take advantage ofthe correlations between targets? Is it the Mahalanobis distance or some othermetric? In other words, could you please give me a hint as to the underlyingreason why the single output predictions differ from the multioutputpredictions? I am curious to know as this would finally fully quench my appetiteafter nearly two years. I will have to retrace my steps and get back to the good old Python ways (again). Thank you. Highest regards,Paul On Friday, February 14, 2020, 07:00:35 a.m. CST, Nicolas Hug <niourf@gmail.com> wrote: Hi Paul, The way multioutput is handled in decision trees (and thus in the forests) is described in https://scikit-learn.org/stable/modules/tree.html#multi-output-problems. As you can see, the correlation between the output values *is* taken into account. Can you explain what you would like to modify there? Nicolas On 2/14/20 7:37 AM, Paul Chike Ofoche via scikit-learn wrote: Scikit-learn random forest does not handle the multi-output case, but only maps to each output one at a time, thereby not accounting for the correlation between multi-outputs, which is what the Mahalanobis distance does. I, as well as other researchers have observed this issue for as much as two years. Could there be a solution to implement it in RandomForest, since Python already has a function that computes Mahalanobis distances? On Thursday, February 13, 2020, 10:15:11 PM CST, Andreas Mueller <t3kcit@gmail.com> wrote: On 2/9/20 12:21 PM, Paul Chike Ofoche via scikit-learn wrote: Hello all, My name is Paul and I am enthused about data science. I have been using Python and other programming languages for close to two years. There is an issue that I have been facing since I began applying Python to the analysis of my research work. My question has remained unanswered for months. Has anybody not run into the need to work with data whereby the regression results are a multiple output, in which the output parameters are correlated with each other? This is called a multi-output multivariate problem. A version of random forest that handles multiple outputs is referred to as the multivariate random forest. It is implemented in the programming language, R (see attached reference documentation below). The scikit-learn random forest actually handles this. It doesn't use the mahalanobis distance but that seems like a simple preprocessing step. Till date, there exists no such package in Python. My question is whether anybody knows how to go about implementing this. The random forest univariate regression case utilizes the Euclidean distance as the measurement criteria, whereas the multivariate regression case uses the Mahalanobis distance, which takes into account the inter-relationships between the multiple outputs. I have inquired about an equivalent capability in Python for many years, but it has still not been addressed. Such a multivariate random forest mode is very applicable to the type of research and analysis that I do. Could someone help, please? Thank you, Paul Ofoche PS: This is an important need for multivariate output analysis as a technique to solving practical research problems. Here are some posted questions by various other Python users concerning this same issue. https://datascience.stackexchange.com/questions/21637/code-for-multivariate-... Multi-output regression | | | | | | | | | | | Multi-output regression I have been looking in to Multi-output regression the last view weeks. I am working with the scikit learn packag... | | | _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
For my knowledge’s sake, could you please inform me about the technique being employed now to take advantage of the correlations between targets? Is it the Mahalanobis distance or some other metric? In other words, could you please give me a hint as to the underlying reason why the single output predictions differ from the multioutput predictions?
I don't know much more than what's already in the doc that I linked to. Namely, the best split is the chosen to minimize the *average* criteria across all outputs, instead of just using a single output. You'll find more details in the code. About the docs: we generally try to write all the useful info about the estimators in the "User Guide" section (https://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-...). In this case you can find a link to the multi-output handling. Sometimes the info is instead in the docstrings. That's not always perfect though, and the link might not have been there when you first looked. We're working hard to keep on improving the docs. But there's so much info that it's easy to miss some... Welcome back to python! On 2/14/20 8:47 PM, Paul Chike Ofoche via scikit-learn wrote:
Many thanks Nicolas and Andreas.
I appreciate your taking the time and effort to look into the issue that I raised and for pointing me to the documentation. It is quite pleasant to know that scikit-learn’s RandomForestRegressor handles multioutput cases. This issue has been very important to me and was the sole reason that I switched from Python to R for my research in the Fall of 2018 and have seldom used Python since then.
I got convinced about my earlier stance when reading a documentation such as https://scikit-learn.org/stable/modules/multiclass.html#multioutput-regressi... which explained that the “MultiOutputRegressor fits one regressor per target and cannot take advantage of correlations between targets”, although I am aware that this is different from the RandomForestRegressor.
Inline image
I was wondering whether this multioutput handling capability of the RandomForestRegressor has been added recently. In order to verify, I went on a fact-finding mission by re-running the exact same codes I had in 2018 and noticed quite a number of changes. I guess that many moons have passed since then!
For instance, sklearn.cross_validation has been deprecated since when last I used it in 2018 (and replaced by sklearn.model_selection). Also, such errors as:
i. ValueError: Expected 2D array, got scalar array instead:
array=6.5.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
and
ii. DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
when passing a *scalar* and a *column-vector y* respectively are entirely new from when last I made use of Python’s RandomForestRegressor. Previously, they worked just fine without throwing out any errors. I know that the “multioutputs” were handled back in 2018 (I actually tested this capability back then), but I assumed that the regressors were fit per target i.e. that there was no correlation between targets.
Today, for comparison, I generated some random target outputs (three columns) and using the same *random_state*, I ran the all-inclusive multioutput prediction (with all three output targets simultaneously vs. re-running each output prediction one at a time). The results are different, implying that some form of correlation takes place amongst the multioutput targets, when predicted together. (For completeness, I display the first 28 predicted output values, from the multioutput prediction as well as the single output predictions.)
Results from the multioutput prediction of the targets (capturing their correlations).
Inline image
Results from the individual prediction of each single output target.
Inline image
For my knowledge’s sake, could you please inform me about the technique being employed now to take advantage of the correlations between targets? Is it the Mahalanobis distance or some other metric? In other words, could you please give me a hint as to the underlying reason why the single output predictions differ from the multioutput predictions? I am curious to know as this would finally fully quench my appetite after nearly two years. I will have to retrace my steps and get back to the good old Python ways (again). Thank you.
Highest regards, Paul
On Friday, February 14, 2020, 07:00:35 a.m. CST, Nicolas Hug <niourf@gmail.com> wrote:
Hi Paul,
The way multioutput is handled in decision trees (and thus in the forests) is described in https://scikit-learn.org/stable/modules/tree.html#multi-output-problems. As you can see, the correlation between the output values *is* taken into account.
Can you explain what you would like to modify there?
Nicolas
On 2/14/20 7:37 AM, Paul Chike Ofoche via scikit-learn wrote: Scikit-learn random forest does *not *handle the multi-output case, but only maps to each output one at a time, thereby not accounting for the correlation between multi-outputs, which is what the Mahalanobis distance does. I, as well as other researchers have observed this issue for as much as two years. Could there be a solution to implement it in RandomForest, since Python already has a function that computes Mahalanobis distances?
On Thursday, February 13, 2020, 10:15:11 PM CST, Andreas Mueller <t3kcit@gmail.com> <mailto:t3kcit@gmail.com> wrote:
On 2/9/20 12:21 PM, Paul Chike Ofoche via scikit-learn wrote:
Hello all,
My name is Paul and I am enthused about data science. I have been using Python and other programming languages for close to two years. There is an issue that I have been facing since I began applying Python to the analysis of my research work.
My question has remained unanswered for months. Has anybody not run into the need to work with data whereby the regression results are a multiple output, in which the output parameters are correlated with each other? This is called a multi-output multivariate problem. A version of random forest that handles multiple outputs is referred to as the multivariate random forest. It is implemented in the programming language, R (see attached reference documentation below).
The scikit-learn random forest actually handles this. It doesn't use the mahalanobis distance but that seems like a simple preprocessing step.
Till date, there exists no such package in Python. My question is whether anybody knows how to go about implementing this. The random forest univariate regression case utilizes the Euclidean distance as the measurement criteria, whereas the multivariate regression case uses the Mahalanobis distance, which takes into account the inter-relationships between the multiple outputs. I have inquired about an equivalent capability in Python for many years, but it has still not been addressed. Such a multivariate random forest mode is very applicable to the type of research and analysis that I do. Could someone help, please?
Thank you,
Paul Ofoche
PS: This is an important need for multivariate output analysis as a technique to solving practical research problems. Here are some posted questions by various other Python users concerning this same issue.
*https://datascience.stackexchange.com/questions/21637/code-for-multivariate-...
Multi-output regression <https://stackoverflow.com/questions/49391637/multi-output-regression>
Multi-output regression
I have been looking in to Multi-output regression the last view weeks. I am working with the scikit learn packag...
<https://stackoverflow.com/questions/49391637/multi-output-regression>
_______________________________________________ scikit-learn mailing list scikit-learn@python.org <mailto:scikit-learn@python.org> https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org <mailto:scikit-learn@python.org> https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org <mailto:scikit-learn@python.org> https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn@python.org <mailto:scikit-learn@python.org> https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
On 2/14/20 5:47 PM, Paul Chike Ofoche via scikit-learn wrote:
Many thanks Nicolas and Andreas.
I was wondering whether this multioutput handling capability of the RandomForestRegressor has been added recently. In order to verify, I went on a fact-finding mission by re-running the exact same codes I had in 2018 and noticed quite a number of changes. I guess that many moons have passed since then!
For instance, sklearn.cross_validation has been deprecated since when last I used it in 2018 (and replaced by sklearn.model_selection). Also, such errors as:
i. ValueError: Expected 2D array, got scalar array instead:
array=6.5.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
and
ii. DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
All of these were errors in 2018 already, you might not have had the most up-to-date version then ;) cross_validation was deprecated in 2016: https://scikit-learn.org/dev/whats_new/v0.18.html#version-0-18
when passing a *scalar* and a *column-vector y* respectively are entirely new from when last I made use of Python’s RandomForestRegressor. Previously, they worked just fine without throwing out any errors. I know that the “multioutputs” were handled back in 2018 (I actually tested this capability back then), but I assumed that the regressors were fit per target i.e. that there was no correlation between targets.
I can't find a changelog entry but pretty sure this goes back to 2014 or so. Definitely it was present in 2018.
Today, for comparison, I generated some random target outputs (three columns) and using the same *random_state*, I ran the all-inclusive multioutput prediction (with all three output targets simultaneously vs. re-running each output prediction one at a time). The results are different, implying that some form of correlation takes place amongst the multioutput targets, when predicted together. (For completeness, I display the first 28 predicted output values, from the multioutput prediction as well as the single output predictions.
For my knowledge’s sake, could you please inform me about the technique being employed now to take advantage of the correlations between targets? Is it the Mahalanobis distance or some other metric? In other words, could you please give me a hint as to the underlying reason why the single output predictions differ from the multioutput predictions? I am curious to know as this would finally fully quench my appetite after nearly two years. I will have to retrace my steps and get back to the good old Python ways (again). Thank you.
It doesn't explicitly use the correlation. The splitting criterion is is the sum over the splitting criteria over the outputs. That means there's an implicit regularization as the tree is shared between the targets.
participants (4)
-
Andreas Mueller -
Bill Ross -
Nicolas Hug -
Paul Chike Ofoche