ANN Dirty_cat: learning on dirty categories
Hi scikit-learn friends, As you might have seen on twitter, my lab -with a few friends- has embarked on research to ease machine on "dirty data". We are experimenting on new encoding methods for non-curated string categories. For this, we are developing a small software project called "dirty_cat": https://dirty-cat.github.io/stable/ dirty_cat is a test bed for new ideas of "dirty categories". It is a research project, though we still try to do decent software engineering :). Rather than contributing to existing codebases (as the great categorical-encoding project in scikit-learn-contrib), we spanned it out in a separate software project to have the freedom to try out ideas that we might give up after gaining insight. We hope that it is a useful tool: if you have non-curated string categories, please give it a try. Understanding what works and what does not is important to know what to consolidate. Hopefully one day we can develop a tool that is of wide-enough interest that it can go in scikit-learn-contrib, or maybe even scikit-learn. Also, if you have suggestions of publicly available databases that we try it upon, we would love to hear from you. Cheers, Gaël PS: if you want to work on dirty-data problems in Paris as a post-doc or an engineer, send me a line
I would love to see the TargetEncoder ported to scikit-learn. The CountFeaturizer is pretty stalled: https://github.com/scikit-learn/scikit-learn/pull/9614 :-/ Have you benchmarked the other encoders in the category_encoding lib? I would be really curious to know when/how they help. On 11/20/18 3:58 PM, Gael Varoquaux wrote:
Hi scikit-learn friends,
As you might have seen on twitter, my lab -with a few friends- has embarked on research to ease machine on "dirty data". We are experimenting on new encoding methods for non-curated string categories. For this, we are developing a small software project called "dirty_cat": https://dirty-cat.github.io/stable/
dirty_cat is a test bed for new ideas of "dirty categories". It is a research project, though we still try to do decent software engineering :). Rather than contributing to existing codebases (as the great categorical-encoding project in scikit-learn-contrib), we spanned it out in a separate software project to have the freedom to try out ideas that we might give up after gaining insight.
We hope that it is a useful tool: if you have non-curated string categories, please give it a try. Understanding what works and what does not is important to know what to consolidate. Hopefully one day we can develop a tool that is of wide-enough interest that it can go in scikit-learn-contrib, or maybe even scikit-learn.
Also, if you have suggestions of publicly available databases that we try it upon, we would love to hear from you.
Cheers,
Gaël
PS: if you want to work on dirty-data problems in Paris as a post-doc or an engineer, send me a line _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
On Tue, Nov 20, 2018 at 04:06:30PM -0500, Andreas Mueller wrote:
I would love to see the TargetEncoder ported to scikit-learn. The CountFeaturizer is pretty stalled: https://github.com/scikit-learn/scikit-learn/pull/9614
So would I. But there are several ways of doing it: - the naive way is not the right one: just computing the average of y for each category leads to overfitting quite fast - it can be done cross-validated, splitting the train data, in a "cross-fit" strategy (see https://github.com/dirty-cat/dirty_cat/issues/53) - it can be done using empirical-Bayes shrinkage, which is what we currently do in dirty_cat. We are planning to do heavy benchmarking of those strategies, to figure out tradeoff. But we won't get to it before February, I am afraid.
Have you benchmarked the other encoders in the category_encoding lib? I would be really curious to know when/how they help.
We did (part of the results are in the publication), and we didn't have great success. Gaël
On 11/20/18 3:58 PM, Gael Varoquaux wrote:
Hi scikit-learn friends,
As you might have seen on twitter, my lab -with a few friends- has embarked on research to ease machine on "dirty data". We are experimenting on new encoding methods for non-curated string categories. For this, we are developing a small software project called "dirty_cat": https://dirty-cat.github.io/stable/
dirty_cat is a test bed for new ideas of "dirty categories". It is a research project, though we still try to do decent software engineering :). Rather than contributing to existing codebases (as the great categorical-encoding project in scikit-learn-contrib), we spanned it out in a separate software project to have the freedom to try out ideas that we might give up after gaining insight.
We hope that it is a useful tool: if you have non-curated string categories, please give it a try. Understanding what works and what does not is important to know what to consolidate. Hopefully one day we can develop a tool that is of wide-enough interest that it can go in scikit-learn-contrib, or maybe even scikit-learn.
Also, if you have suggestions of publicly available databases that we try it upon, we would love to hear from you.
Cheers,
Gaël
PS: if you want to work on dirty-data problems in Paris as a post-doc or an engineer, send me a line _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- Gael Varoquaux Senior Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux
On 11/20/18 4:16 PM, Gael Varoquaux wrote:
- the naive way is not the right one: just computing the average of y for each category leads to overfitting quite fast
- it can be done cross-validated, splitting the train data, in a "cross-fit" strategy (seehttps://github.com/dirty-cat/dirty_cat/issues/53) This is called leave-one-out in the category_encoding library, I think, and that's what my first implementation would be.
- it can be done using empirical-Bayes shrinkage, which is what we currently do in dirty_cat. Reference / explanation?
We are planning to do heavy benchmarking of those strategies, to figure out tradeoff. But we won't get to it before February, I am afraid. aww ;)
On Tue, Nov 20, 2018 at 04:35:43PM -0500, Andreas Mueller wrote:
- it can be done cross-validated, splitting the train data, in a "cross-fit" strategy (seehttps://github.com/dirty-cat/dirty_cat/issues/53) This is called leave-one-out in the category_encoding library, I think, and that's what my first implementation would be.
- it can be done using empirical-Bayes shrinkage, which is what we currently do in dirty_cat. Reference / explanation?
I think that a good reference is the prior art part of our paper: https://arxiv.org/abs/1806.00979 But we found the following reference helpful Micci-Barreca, D.: A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explorations Newsletter 3(1), 27–32 (2001)
We are planning to do heavy benchmarking of those strategies, to figure out tradeoff. But we won't get to it before February, I am afraid. aww ;)
Yeah. I do slow science. Slow everything, actually :(. Gaël
On 11/20/18 4:43 PM, Gael Varoquaux wrote:
We are planning to do heavy benchmarking of those strategies, to figure out tradeoff. But we won't get to it before February, I am afraid. Does that mean you'd be opposed to adding the leave-one-out TargetEncoder
before you do this? I would really like to add it before February and it's pretty established.
On Tue, Nov 20, 2018 at 09:58:49PM -0500, Andreas Mueller wrote:
On 11/20/18 4:43 PM, Gael Varoquaux wrote:
We are planning to do heavy benchmarking of those strategies, to figure out tradeoff. But we won't get to it before February, I am afraid. Does that mean you'd be opposed to adding the leave-one-out TargetEncoder
I'd rather not. Or rather, I'd rather have some benchmarks on it (it doesn't have to be us that does it).
I would really like to add it before February
A few month to get it right is not that bad, is it?
and it's pretty established.
Are there good references studying it? If they is a clear track of study, it falls in the usual rules, and should go in. Gaël
On 11/21/18 12:38 AM, Gael Varoquaux wrote:
On Tue, Nov 20, 2018 at 09:58:49PM -0500, Andreas Mueller wrote:
On 11/20/18 4:43 PM, Gael Varoquaux wrote:
We are planning to do heavy benchmarking of those strategies, to figure out tradeoff. But we won't get to it before February, I am afraid. Does that mean you'd be opposed to adding the leave-one-out TargetEncoder I'd rather not. Or rather, I'd rather have some benchmarks on it (it doesn't have to be us that does it).
I would really like to add it before February A few month to get it right is not that bad, is it?
The PR is over a year old already, and you hadn't voiced any opposition there.
On Wed, Nov 21, 2018 at 09:47:13AM -0500, Andreas Mueller wrote:
The PR is over a year old already, and you hadn't voiced any opposition there.
My bad, sorry. Given the name, I had not guessed the link between the PR and encoding of categorical features. I find myself very much in agreement with the original issue and its discussion: https://github.com/scikit-learn/scikit-learn/issues/5853 concerns about the name and importance of at least considering prior smoothing. I do not see these reflected in the PR. In general, the fact that there is not much literature on this implies that we should be benchmarking our choices. The more I understand kaggle, the less I think that we can fully use it as an inclusion argument: people do transforms that end up to be very specific to one challenge. On the specific problem of categorical encoding, we've tried to do systematic analysis of some of these, and were not very successful empirically (eg hashing encoding). This is not at all a vote against target encoding, which our benchmarks showed was very useful, but just a push for benchmarking PRs, in particular when they do not correspond to well cited work (which is our standard inclusion criterion). Joris has just accepted to help with benchmarking. We can have preliminary results earlier. The question really is: out of the different variants that exist, which one should we choose. I think that it is a legitimate question that arises on many of our PRs. But in general, I don't think that we should rush things because of deadlines. Consequences of a rush are that we need to change things after merge, which is more work. I know that it is slow, but we are quite a central package. Gaël
On 11/21/18 10:34 AM, Gael Varoquaux wrote:
Joris has just accepted to help with benchmarking. We can have preliminary results earlier. The question really is: out of the different variants that exist, which one should we choose. I think that it is a legitimate question that arises on many of our PRs. Thanks Joris! I could also ask Jan to help ;) The question for this particular issue for me is also "what are good benchmark datasets". It's a somewhat different task than what you're benchmarking with dirty cat, right? In dirty cat you used dirty categories, which is a subset of all high-cardinality categorical variables. Whether "clean" high cardinality variables like zip-codes or dirty ones are the better benchmark is a bit unclear to me, and I'm not aware of a wealth of datasets for either :-/
But in general, I don't think that we should rush things because of deadlines. Consequences of a rush are that we need to change things after merge, which is more work. I know that it is slow, but we are quite a central package.
I agree.
On Wed, Nov 21, 2018 at 11:35:11AM -0500, Andreas Mueller wrote:
The question for this particular issue for me is also "what are good benchmark datasets". In dirty cat you used dirty categories, which is a subset of all high-cardinality categorical variables. Whether "clean" high cardinality variables like zip-codes or dirty ones are the better benchmark is a bit unclear to me, and I'm not aware of a wealth of datasets for either :-/
Fair point. We'll have a look to see what we can find. We're open to suggestions, from you or from anyone else. G
Maybe a subset of the criteo TB dataset?
Hi all, I finally had some time to start looking at it the last days. Some preliminary work can be found here: https://github.com/jorisvandenbossche/target-encoder-benchmarks. Up to now, I only did some preliminary work to set up the benchmarks (based on Patricio Cerda's code, https://arxiv.org/pdf/1806.00979.pdf), and with some initial datasets (medical charges and employee salaries) compared the different implementations with its default settings. So there is still a lot to do (add datasets, investigate the actual differences between the different implementations and results, in a more structured way compare the options, etc, there are some todo's listed in the README). However, now I am mostly on holidays for the rest of December. If somebody wants to further look at it, that is certainly welcome, otherwise, it will be a priority for me beginning of January. For datasets: additional ideas are welcome. For now, the idea is to add a subset of the Criteo Terabyte Click dataset, and to generate some data.
Does that mean you'd be opposed to adding the leave-one-out TargetEncoder I would really like to add it before February A few month to get it right is not that bad, is it? The PR is over a year old already, and you hadn't voiced any opposition there.
As far as I understand, the open PR is not a leave-one-out TargetEncoder? I also did not yet add the CountFeaturizer from that scikit-learn PR, because it is actually quite different (e.g it doesn't work for regression tasks, as it counts conditional on y). But for classification it could be easily added to the benchmarks. Joris
Hi all,
I finally had some time to start looking at it the last days. Some preliminary work can be found here: https://github.com/jorisvandenbossche/target-encoder-benchmarks. You continue to be my hero. Probably can not look at it in detail before
On 12/13/18 4:16 AM, Joris Van den Bossche wrote: the holidays though :-/
Up to now, I only did some preliminary work to set up the benchmarks (based on Patricio Cerda's code, https://arxiv.org/pdf/1806.00979.pdf), and with some initial datasets (medical charges and employee salaries) compared the different implementations with its default settings. So there is still a lot to do (add datasets, investigate the actual differences between the different implementations and results, in a more structured way compare the options, etc, there are some todo's listed in the README). However, now I am mostly on holidays for the rest of December. If somebody wants to further look at it, that is certainly welcome, otherwise, it will be a priority for me beginning of January.
For datasets: additional ideas are welcome. For now, the idea is to add a subset of the Criteo Terabyte Click dataset, and to generate some data.
Does that mean you'd be opposed to adding the leave-one-out TargetEncoder I would really like to add it before February A few month to get it right is not that bad, is it? The PR is over a year old already, and you hadn't voiced any opposition there.
As far as I understand, the open PR is not a leave-one-out TargetEncoder?
I would want it to be :-/
I also did not yet add the CountFeaturizer from that scikit-learn PR, because it is actually quite different (e.g it doesn't work for regression tasks, as it counts conditional on y). But for classification it could be easily added to the benchmarks. I'm confused now. That's what TargetEncoder and leave-one-out TargetEncoder do as well, right?
Op vr 14 dec. 2018 om 16:46 schreef Andreas Mueller <t3kcit@gmail.com>:
As far as I understand, the open PR is not a leave-one-out TargetEncoder?
I would want it to be :-/
I also did not yet add the CountFeaturizer from that scikit-learn PR, because it is actually quite different (e.g it doesn't work for regression tasks, as it counts conditional on y). But for classification it could be easily added to the benchmarks.
I'm confused now. That's what TargetEncoder and leave-one-out TargetEncoder do as well, right?.
As far as I understand, that is not exactly what those do. The TargetEncoder (as implemented in dirty_cat, category_encoders and hccEncoders) will, for each category, calculate the expected value of the target depending on the category. For binary classification this indeed comes to counting the 0's and 1's, and there the information contained in the result might be similar as the sklearn PR, but the format is different: those packages calculate the probability (value between 0 and 1 as number of 1's divided by number of samples in that category) and return that as a single column, instead of returning two columns with the counts for the 0's and 1's. And for regression this is not related to counting anymore, but just the average of the target per category (in practice, the TargetEncoder is computing the same for regression or binary classification: the average of the target per category. But for regression, the CountFeaturizer doesn't work since there are no discrete values in the target to count). Furthermore, all of those implementations in the 3 mentioned packages have some kind of regularization (empirical bayes shrinkage, or KFold or leave-one-out cross-validation), while this is also not present in the CountFeaturizer PR (but this aspect is of course something we want to actually test in the benchmarks). Another thing I noticed in the CountFeaturizer implementation, is that the behaviour differs when y is passed or not. First, I find it a bit strange to do this as it is a quite different behaviour (counting the categories (to just encode the categorical variable with a notion about its frequency in the training set), or counting the target depending on the category is quite different?). But also, when using a transformer in a Pipeline, you don't control the passing of y, I think? So in that way, you always have the behaviour of counting the target. I would find it more logical to have those two things in two separate transformers (if we think the "frequency encoder" is useful enough). (I need to give this feedback on the PR, but that will be for after the holidays) Joris
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
On 12/15/18 7:35 AM, Joris Van den Bossche wrote:
Op vr 14 dec. 2018 om 16:46 schreef Andreas Mueller <t3kcit@gmail.com <mailto:t3kcit@gmail.com>>:
As far as I understand, the open PR is not a leave-one-out TargetEncoder?
I would want it to be :-/
I also did not yet add the CountFeaturizer from that scikit-learn PR, because it is actually quite different (e.g it doesn't work for regression tasks, as it counts conditional on y). But for classification it could be easily added to the benchmarks.
I'm confused now. That's what TargetEncoder and leave-one-out TargetEncoder do as well, right?.
As far as I understand, that is not exactly what those do. The TargetEncoder (as implemented in dirty_cat, category_encoders and hccEncoders) will, for each category, calculate the expected value of the target depending on the category. For binary classification this indeed comes to counting the 0's and 1's, and there the information contained in the result might be similar as the sklearn PR, but the format is different: those packages calculate the probability (value between 0 and 1 as number of 1's divided by number of samples in that category) and return that as a single column, instead of returning two columns with the counts for the 0's and 1's. This is a standard case of the "binary special case", right? For multi-class you need multiple columns, right? Doing a single column for binary makes sense, I think.
And for regression this is not related to counting anymore, but just the average of the target per category (in practice, the TargetEncoder is computing the same for regression or binary classification: the average of the target per category. But for regression, the CountFeaturizer doesn't work since there are no discrete values in the target to count). I guess CountFeaturizer was not implemented with regression in mind. Actually being able to do regression and classification in the same estimator shows that "CountFeaturizer" is probably the wrong name.
Furthermore, all of those implementations in the 3 mentioned packages have some kind of regularization (empirical bayes shrinkage, or KFold or leave-one-out cross-validation), while this is also not present in the CountFeaturizer PR (but this aspect is of course something we want to actually test in the benchmarks).
Another thing I noticed in the CountFeaturizer implementation, is that the behaviour differs when y is passed or not. First, I find it a bit strange to do this as it is a quite different behaviour (counting the categories (to just encode the categorical variable with a notion about its frequency in the training set), or counting the target depending on the category is quite different?). But also, when using a transformer in a Pipeline, you don't control the passing of y, I think? So in that way, you always have the behaviour of counting the target. I would find it more logical to have those two things in two separate transformers (if we think the "frequency encoder" is useful enough). (I need to give this feedback on the PR, but that will be for after the holidays)
I'm pretty sure I mentioned that before, I think optional y is bad. I just thought it was weird but the pipeline argument is a good one.
participants (4)
-
Andreas Mueller -
Gael Varoquaux -
Joris Van den Bossche -
Olivier Grisel