custom loss function in RandomForestRegressor
Greetings, The feature importance calculated by the RandomForest implementation is a very useful feature. I personally use it to select the best features because it is simple and fast, and then I train MLPRegressors. The limitation of this approach is that although I can control the loss function of the MLPRegressor (I have modified scikit-learn's implementation to accept an arbitrary loss function), I cannot do the same with RandomForestRegressor, and hence I have to rely on 'mse' which is not in accordance with the loss functions I use in MLPs. Today I was looking at the _criterion.pyx file: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_crite... However, the code is in Cython and I find it hard to follow. I know that for Regression the relevant class are Criterion(), RegressionCriterion(Criterion), and MSE(RegressionCriterion). My question is: is it possible to write a class that takes an arbitrary function "loss(predictions, targets)" to calculate the loss and impurity of the nodes? thanks, Thomas -- ====================================================================== Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tevang@pharm.uoa.gr tevang3@gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/
Yes, but if you write it in Python, not Cython, it will be unbearably slow. On 02/15/2018 12:37 PM, Thomas Evangelidis wrote:
Greetings,
The feature importance calculated by the RandomForest implementation is a very useful feature. I personally use it to select the best features because it is simple and fast, and then I train MLPRegressors. The limitation of this approach is that although I can control the loss function of the MLPRegressor (I have modified scikit-learn's implementation to accept an arbitrary loss function), I cannot do the same with RandomForestRegressor, and hence I have to rely on 'mse' which is not in accordance with the loss functions I use in MLPs. Today I was looking at the _criterion.pyx file:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_crite...
However, the code is in Cython and I find it hard to follow. I know that for Regression the relevant class are Criterion(), RegressionCriterion(Criterion), and MSE(RegressionCriterion). My question is: is it possible to write a class that takes an arbitrary function "loss(predictions, targets)" to calculate the loss and impurity of the nodes?
thanks, Thomas
--
======================================================================
Dr Thomas Evangelidis
Post-doctoral Researcher
CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic
email: tevang@pharm.uoa.gr <mailto:tevang@pharm.uoa.gr>
tevang3@gmail.com <mailto:tevang3@gmail.com>
website: https://sites.google.com/site/thomasevangelidishomepage/
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
The ClassificationCriterion and RegressionCriterion are now exposed in the _criterion.pxd. It will allow you to create your own criterion. So you can write your own Criterion with a given loss by implementing the methods which are required in the trees. Then you can pass an instance of this criterion to the tree and it should work. On 15 February 2018 at 18:37, Thomas Evangelidis <tevang3@gmail.com> wrote:
Greetings,
The feature importance calculated by the RandomForest implementation is a very useful feature. I personally use it to select the best features because it is simple and fast, and then I train MLPRegressors. The limitation of this approach is that although I can control the loss function of the MLPRegressor (I have modified scikit-learn's implementation to accept an arbitrary loss function), I cannot do the same with RandomForestRegressor, and hence I have to rely on 'mse' which is not in accordance with the loss functions I use in MLPs. Today I was looking at the _criterion.pyx file:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_ criterion.pyx
However, the code is in Cython and I find it hard to follow. I know that for Regression the relevant class are Criterion(), RegressionCriterion(Criterion), and MSE(RegressionCriterion). My question is: is it possible to write a class that takes an arbitrary function "loss(predictions, targets)" to calculate the loss and impurity of the nodes?
thanks, Thomas
--
======================================================================
Dr Thomas Evangelidis
Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic
email: tevang@pharm.uoa.gr
tevang3@gmail.com
website: https://sites.google.com/site/thomasevangelidishomepage/
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/
I wonder whether this (together with the caveat about it being slow if doing python) should go into the FAQ. On 02/15/2018 12:50 PM, Guillaume Lemaître wrote:
The ClassificationCriterion and RegressionCriterion are now exposed in the _criterion.pxd. It will allow you to create your own criterion. So you can write your own Criterion with a given loss by implementing the methods which are required in the trees. Then you can pass an instance of this criterion to the tree and it should work.
On 15 February 2018 at 18:37, Thomas Evangelidis <tevang3@gmail.com <mailto:tevang3@gmail.com>> wrote:
Greetings,
The feature importance calculated by the RandomForest implementation is a very useful feature. I personally use it to select the best features because it is simple and fast, and then I train MLPRegressors. The limitation of this approach is that although I can control the loss function of the MLPRegressor (I have modified scikit-learn's implementation to accept an arbitrary loss function), I cannot do the same with RandomForestRegressor, and hence I have to rely on 'mse' which is not in accordance with the loss functions I use in MLPs. Today I was looking at the _criterion.pyx file:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_crite... <https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_crite...>
However, the code is in Cython and I find it hard to follow. I know that for Regression the relevant class are Criterion(), RegressionCriterion(Criterion), and MSE(RegressionCriterion). My question is: is it possible to write a class that takes an arbitrary function "loss(predictions, targets)" to calculate the loss and impurity of the nodes?
thanks, Thomas
--
======================================================================
Dr Thomas Evangelidis
Post-doctoral Researcher
CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic
email: tevang@pharm.uoa.gr <mailto:tevang@pharm.uoa.gr>
tevang3@gmail.com <mailto:tevang3@gmail.com>
website: https://sites.google.com/site/thomasevangelidishomepage/ <https://sites.google.com/site/thomasevangelidishomepage/>
_______________________________________________ scikit-learn mailing list scikit-learn@python.org <mailto:scikit-learn@python.org> https://mail.python.org/mailman/listinfo/scikit-learn <https://mail.python.org/mailman/listinfo/scikit-learn>
-- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Sorry I don't know Cython at all. _criterion.pxd is like the header file in C++? I see that it contains class, function and variable definitions and their description in comments. class Criterion is an Interface, doesn't have function definitions. By "writing your own criterion with a given loss" you mean writing a class like MSE(RegressionCriterion)? On 15 February 2018 at 18:50, Guillaume Lemaître <g.lemaitre58@gmail.com> wrote:
The ClassificationCriterion and RegressionCriterion are now exposed in the _criterion.pxd. It will allow you to create your own criterion. So you can write your own Criterion with a given loss by implementing the methods which are required in the trees. Then you can pass an instance of this criterion to the tree and it should work.
-- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- ====================================================================== Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tevang@pharm.uoa.gr tevang3@gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/
Is it possible to compile just _criterion.pyx and _criterion.pxd files by using "importpyx" or any alternative way instead of compiling the whole sklearn library every time I introduce a change? Dne 15. 2. 2018 19:29 napsal uživatel "Guillaume Lemaitre" < g.lemaitre58@gmail.com>: Yes you are right pxd are the header and pyx the definition. You need to write a class as MSE. Criterion is an abstract class or base class (I don't have it under the eye) @Andy: if I recall the PR, we made the classes public to enable such custom criterion. However, it is not documented since we were not officially supporting it. So this is an hidden feature. We could always discuss to make this feature more visible and document it. Guillaume Lemaitre INRIA Saclay Ile-de-France / Equipe PARIETAL guillaume.lemaitre@inria.fr - https://glemaitre.github.io/ *From: *Thomas Evangelidis *Sent: *Thursday, 15 February 2018 19:15 *To: *Scikit-learn mailing list *Reply To: *Scikit-learn mailing list *Subject: *Re: [scikit-learn] custom loss function in RandomForestRegressor Sorry I don't know Cython at all. _criterion.pxd is like the header file in C++? I see that it contains class, function and variable definitions and their description in comments. class Criterion is an Interface, doesn't have function definitions. By "writing your own criterion with a given loss" you mean writing a class like MSE(RegressionCriterion)? On 15 February 2018 at 18:50, Guillaume Lemaître <g.lemaitre58@gmail.com> wrote:
The ClassificationCriterion and RegressionCriterion are now exposed in the _criterion.pxd. It will allow you to create your own criterion. So you can write your own Criterion with a given loss by implementing the methods which are required in the trees. Then you can pass an instance of this criterion to the tree and it should work.
-- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- ====================================================================== Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tevang@pharm.uoa.gr tevang3@gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Calling `python setup.py build_ext --inplace` (also `make in`) will only recompile the files which change without recompiling everything. However, it is true that it can lead to some error which require a clean and recompile everything. On 15 February 2018 at 20:46, Thomas Evangelidis <tevang3@gmail.com> wrote:
Is it possible to compile just _criterion.pyx and _criterion.pxd files by using "importpyx" or any alternative way instead of compiling the whole sklearn library every time I introduce a change?
Dne 15. 2. 2018 19:29 napsal uživatel "Guillaume Lemaitre" < g.lemaitre58@gmail.com>:
Yes you are right pxd are the header and pyx the definition. You need to write a class as MSE. Criterion is an abstract class or base class (I don't have it under the eye)
@Andy: if I recall the PR, we made the classes public to enable such custom criterion. However, it is not documented since we were not officially supporting it. So this is an hidden feature. We could always discuss to make this feature more visible and document it.
Guillaume Lemaitre INRIA Saclay Ile-de-France / Equipe PARIETAL guillaume.lemaitre@inria.fr - https://glemaitre.github.io/ *From: *Thomas Evangelidis *Sent: *Thursday, 15 February 2018 19:15 *To: *Scikit-learn mailing list *Reply To: *Scikit-learn mailing list *Subject: *Re: [scikit-learn] custom loss function in RandomForestRegressor
Sorry I don't know Cython at all. _criterion.pxd is like the header file in C++? I see that it contains class, function and variable definitions and their description in comments.
class Criterion is an Interface, doesn't have function definitions. By "writing your own criterion with a given loss" you mean writing a class like MSE(RegressionCriterion)?
On 15 February 2018 at 18:50, Guillaume Lemaître <g.lemaitre58@gmail.com> wrote:
The ClassificationCriterion and RegressionCriterion are now exposed in the _criterion.pxd. It will allow you to create your own criterion. So you can write your own Criterion with a given loss by implementing the methods which are required in the trees. Then you can pass an instance of this criterion to the tree and it should work.
-- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
--
======================================================================
Dr Thomas Evangelidis
Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic
email: tevang@pharm.uoa.gr
tevang3@gmail.com
website: https://sites.google.com/site/thomasevangelidishomepage/
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/
On 02/15/2018 01:28 PM, Guillaume Lemaitre wrote:
Yes you are right pxd are the header and pyx the definition. You need to write a class as MSE. Criterion is an abstract class or base class (I don't have it under the eye)
@Andy: if I recall the PR, we made the classes public to enable such custom criterion. However, it is not documented since we were not officially supporting it. So this is an hidden feature. We could always discuss to make this feature more visible and document it. Well maybe not go as far as giving examples, but this question has been on the list >10 times.
10 times means that we could write something in the doc :)
On 15 February 2018 at 21:27, Andreas Mueller <t3kcit@gmail.com> wrote:
On 02/15/2018 01:28 PM, Guillaume Lemaitre wrote:
Yes you are right pxd are the header and pyx the definition. You need to write a class as MSE. Criterion is an abstract class or base class (I don't have it under the eye)
@Andy: if I recall the PR, we made the classes public to enable such custom criterion. However, it is not documented since we were not officially supporting it. So this is an hidden feature. We could always discuss to make this feature more visible and document it.
Well maybe not go as far as giving examples, but this question has been on the list >10 times.
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/
Hi again, I am currently revisiting this problem after familiarizing myself with Cython and Scikit-Learn's code and I have a very important query: Looking at the class MSE(RegressionCriterion), the node impurity is defined as the variance of the target values Y on that node. The predictions X are nowhere involved in the computations. This contradicts my notion of "loss function", which quantifies the discrepancy between predicted and target values. Am I looking at the wrong class or what I want to do is just not feasible with Random Forests? For example, I would like to modify the RandomForestRegressor code to minimize the Pearson's R between predicted and target values. I thank you in advance for any clarification. Thomas
On 02/15/2018 01:28 PM, Guillaume Lemaitre wrote:
Yes you are right pxd are the header and pyx the definition. You need to write a class as MSE. Criterion is an abstract class or base class (I don't have it under the eye)
@Andy: if I recall the PR, we made the classes public to enable such custom criterion. However, it is not documented since we were not officially supporting it. So this is an hidden feature. We could always discuss to make this feature more visible and document it.
-- ====================================================================== Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tevang@pharm.uoa.gr tevang3@gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/
Hi, Thomas, in regression trees, minimizing the variance among the target values is equivalent to minimizing the MSE between targets and predicted values. This is also called variance reduction: https://en.wikipedia.org/wiki/Decision_tree_learning#Variance_reduction Best, Sebastian
On Mar 1, 2018, at 8:27 AM, Thomas Evangelidis <tevang3@gmail.com> wrote:
Hi again,
I am currently revisiting this problem after familiarizing myself with Cython and Scikit-Learn's code and I have a very important query:
Looking at the class MSE(RegressionCriterion), the node impurity is defined as the variance of the target values Y on that node. The predictions X are nowhere involved in the computations. This contradicts my notion of "loss function", which quantifies the discrepancy between predicted and target values. Am I looking at the wrong class or what I want to do is just not feasible with Random Forests? For example, I would like to modify the RandomForestRegressor code to minimize the Pearson's R between predicted and target values.
I thank you in advance for any clarification. Thomas
On 02/15/2018 01:28 PM, Guillaume Lemaitre wrote:
Yes you are right pxd are the header and pyx the definition. You need to write a class as MSE. Criterion is an abstract class or base class (I don't have it under the eye)
@Andy: if I recall the PR, we made the classes public to enable such custom criterion. However, it is not documented since we were not officially supporting it. So this is an hidden feature. We could always discuss to make this feature more visible and document it.
-- ====================================================================== Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic
email: tevang@pharm.uoa.gr tevang3@gmail.com
website: https://sites.google.com/site/thomasevangelidishomepage/
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
On Mar 1, 2018, at 8:27 AM, Thomas Evangelidis <tevang3@gmail.com> wrote:
Hi again,
I am currently revisiting this problem after familiarizing myself with Cython and Scikit-Learn's code and I have a very important query:
Looking at the class MSE(RegressionCriterion), the node impurity is defined as the variance of the target values Y on that node. The
Hi Sebastian, Going back to Pearson's R loss function, does this imply that I must add an abstract "init2" method to RegressionCriterion (that's where MSE class inherits from) where I will add the target values X as extra argument? And then the node impurity will be 1-R (the lowest the best)? What about the impurities of the left and right split? In MSE class they are (sum_i^n y_i)**2 where n is the number of samples in the respective split. It is not clear how this is related to variance in order to adapt it for my purpose. Best, Thomas On Mar 1, 2018 14:56, "Sebastian Raschka" <se.raschka@gmail.com> wrote: Hi, Thomas, in regression trees, minimizing the variance among the target values is equivalent to minimizing the MSE between targets and predicted values. This is also called variance reduction: https://en.wikipedia.org/wiki/ Decision_tree_learning#Variance_reduction Best, Sebastian predictions X are nowhere involved in the computations. This contradicts my notion of "loss function", which quantifies the discrepancy between predicted and target values. Am I looking at the wrong class or what I want to do is just not feasible with Random Forests? For example, I would like to modify the RandomForestRegressor code to minimize the Pearson's R between predicted and target values.
I thank you in advance for any clarification. Thomas
On 02/15/2018 01:28 PM, Guillaume Lemaitre wrote:
Yes you are right pxd are the header and pyx the definition. You need to
write a class as MSE. Criterion is an abstract class or base class (I don't have it under the eye)
@Andy: if I recall the PR, we made the classes public to enable such
custom criterion. However, it is not documented since we were not officially supporting it. So this is an hidden feature. We could always discuss to make this feature more visible and document it.
-- ====================================================================== Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic
email: tevang@pharm.uoa.gr tevang3@gmail.com
website: https://sites.google.com/site/thomasevangelidishomepage/
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Hi, Thomas, as far as I know, it's all the same and doesn't matter, and you would get the same splits, since R^2 is just a rescaled MSE. Best, Sebastian
On Mar 1, 2018, at 9:39 AM, Thomas Evangelidis <tevang3@gmail.com> wrote:
Hi Sebastian,
Going back to Pearson's R loss function, does this imply that I must add an abstract "init2" method to RegressionCriterion (that's where MSE class inherits from) where I will add the target values X as extra argument? And then the node impurity will be 1-R (the lowest the best)? What about the impurities of the left and right split? In MSE class they are (sum_i^n y_i)**2 where n is the number of samples in the respective split. It is not clear how this is related to variance in order to adapt it for my purpose.
Best, Thomas
On Mar 1, 2018 14:56, "Sebastian Raschka" <se.raschka@gmail.com> wrote: Hi, Thomas,
in regression trees, minimizing the variance among the target values is equivalent to minimizing the MSE between targets and predicted values. This is also called variance reduction: https://en.wikipedia.org/wiki/Decision_tree_learning#Variance_reduction
Best, Sebastian
On Mar 1, 2018, at 8:27 AM, Thomas Evangelidis <tevang3@gmail.com> wrote:
Hi again,
I am currently revisiting this problem after familiarizing myself with Cython and Scikit-Learn's code and I have a very important query:
Looking at the class MSE(RegressionCriterion), the node impurity is defined as the variance of the target values Y on that node. The predictions X are nowhere involved in the computations. This contradicts my notion of "loss function", which quantifies the discrepancy between predicted and target values. Am I looking at the wrong class or what I want to do is just not feasible with Random Forests? For example, I would like to modify the RandomForestRegressor code to minimize the Pearson's R between predicted and target values.
I thank you in advance for any clarification. Thomas
On 02/15/2018 01:28 PM, Guillaume Lemaitre wrote:
Yes you are right pxd are the header and pyx the definition. You need to write a class as MSE. Criterion is an abstract class or base class (I don't have it under the eye)
@Andy: if I recall the PR, we made the classes public to enable such custom criterion. However, it is not documented since we were not officially supporting it. So this is an hidden feature. We could always discuss to make this feature more visible and document it.
-- ====================================================================== Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic
email: tevang@pharm.uoa.gr tevang3@gmail.com
website: https://sites.google.com/site/thomasevangelidishomepage/
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Does this generalize to any loss function? For example I also want to implement Kendall's tau correlation coefficient and a combination of R, tau and RMSE. :) On Mar 1, 2018 15:49, "Sebastian Raschka" <se.raschka@gmail.com> wrote:
Hi, Thomas,
as far as I know, it's all the same and doesn't matter, and you would get the same splits, since R^2 is just a rescaled MSE.
Best, Sebastian
On Mar 1, 2018, at 9:39 AM, Thomas Evangelidis <tevang3@gmail.com> wrote:
Hi Sebastian,
Going back to Pearson's R loss function, does this imply that I must add an abstract "init2" method to RegressionCriterion (that's where MSE class inherits from) where I will add the target values X as extra argument? And then the node impurity will be 1-R (the lowest the best)? What about the impurities of the left and right split? In MSE class they are (sum_i^n y_i)**2 where n is the number of samples in the respective split. It is not clear how this is related to variance in order to adapt it for my purpose.
Best, Thomas
On Mar 1, 2018 14:56, "Sebastian Raschka" <se.raschka@gmail.com> wrote: Hi, Thomas,
in regression trees, minimizing the variance among the target values is equivalent to minimizing the MSE between targets and predicted values. This is also called variance reduction: https://en.wikipedia.org/wiki/ Decision_tree_learning#Variance_reduction
Best, Sebastian
On Mar 1, 2018, at 8:27 AM, Thomas Evangelidis <tevang3@gmail.com> wrote:
Hi again,
I am currently revisiting this problem after familiarizing myself with Cython and Scikit-Learn's code and I have a very important query:
Looking at the class MSE(RegressionCriterion), the node impurity is defined as the variance of the target values Y on that node. The predictions X are nowhere involved in the computations. This contradicts my notion of "loss function", which quantifies the discrepancy between predicted and target values. Am I looking at the wrong class or what I want to do is just not feasible with Random Forests? For example, I would like to modify the RandomForestRegressor code to minimize the Pearson's R between predicted and target values.
I thank you in advance for any clarification. Thomas
On 02/15/2018 01:28 PM, Guillaume Lemaitre wrote:
Yes you are right pxd are the header and pyx the definition. You need to write a class as MSE. Criterion is an abstract class or base class (I don't have it under the eye)
@Andy: if I recall the PR, we made the classes public to enable such custom criterion. However, it is not documented since we were not officially supporting it. So this is an hidden feature. We could always discuss to make this feature more visible and document it.
-- ====================================================================== Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic
email: tevang@pharm.uoa.gr tevang3@gmail.com
website: https://sites.google.com/site/thomasevangelidishomepage/
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Unfortunately (or maybe fortunately :)) no, maximizing variance reduction & minimizing MSE are just special cases :) Best, Sebastian
On Mar 1, 2018, at 9:59 AM, Thomas Evangelidis <tevang3@gmail.com> wrote:
Does this generalize to any loss function? For example I also want to implement Kendall's tau correlation coefficient and a combination of R, tau and RMSE. :)
On Mar 1, 2018 15:49, "Sebastian Raschka" <se.raschka@gmail.com> wrote: Hi, Thomas,
as far as I know, it's all the same and doesn't matter, and you would get the same splits, since R^2 is just a rescaled MSE.
Best, Sebastian
On Mar 1, 2018, at 9:39 AM, Thomas Evangelidis <tevang3@gmail.com> wrote:
Hi Sebastian,
Going back to Pearson's R loss function, does this imply that I must add an abstract "init2" method to RegressionCriterion (that's where MSE class inherits from) where I will add the target values X as extra argument? And then the node impurity will be 1-R (the lowest the best)? What about the impurities of the left and right split? In MSE class they are (sum_i^n y_i)**2 where n is the number of samples in the respective split. It is not clear how this is related to variance in order to adapt it for my purpose.
Best, Thomas
On Mar 1, 2018 14:56, "Sebastian Raschka" <se.raschka@gmail.com> wrote: Hi, Thomas,
in regression trees, minimizing the variance among the target values is equivalent to minimizing the MSE between targets and predicted values. This is also called variance reduction: https://en.wikipedia.org/wiki/Decision_tree_learning#Variance_reduction
Best, Sebastian
On Mar 1, 2018, at 8:27 AM, Thomas Evangelidis <tevang3@gmail.com> wrote:
Hi again,
I am currently revisiting this problem after familiarizing myself with Cython and Scikit-Learn's code and I have a very important query:
Looking at the class MSE(RegressionCriterion), the node impurity is defined as the variance of the target values Y on that node. The predictions X are nowhere involved in the computations. This contradicts my notion of "loss function", which quantifies the discrepancy between predicted and target values. Am I looking at the wrong class or what I want to do is just not feasible with Random Forests? For example, I would like to modify the RandomForestRegressor code to minimize the Pearson's R between predicted and target values.
I thank you in advance for any clarification. Thomas
On 02/15/2018 01:28 PM, Guillaume Lemaitre wrote:
Yes you are right pxd are the header and pyx the definition. You need to write a class as MSE. Criterion is an abstract class or base class (I don't have it under the eye)
@Andy: if I recall the PR, we made the classes public to enable such custom criterion. However, it is not documented since we were not officially supporting it. So this is an hidden feature. We could always discuss to make this feature more visible and document it.
-- ====================================================================== Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic
email: tevang@pharm.uoa.gr tevang3@gmail.com
website: https://sites.google.com/site/thomasevangelidishomepage/
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
participants (5)
-
Andreas Mueller -
Guillaume Lemaitre -
Guillaume Lemaître -
Sebastian Raschka -
Thomas Evangelidis