[scikit-learn] benchmarking TargetEncoder Was: ANN Dirty_cat: learning on dirty categories

Andreas Mueller t3kcit at gmail.com
Wed Dec 19 17:31:23 EST 2018



On 12/15/18 7:35 AM, Joris Van den Bossche wrote:
> Op vr 14 dec. 2018 om 16:46 schreef Andreas Mueller <t3kcit at gmail.com 
> <mailto:t3kcit at gmail.com>>:
>
>>     As far as I understand, the open PR is not a leave-one-out
>>     TargetEncoder?
>     I would want it to be :-/
>>     I also did not yet add the CountFeaturizer from that scikit-learn
>>     PR, because it is actually quite different (e.g it doesn't work
>>     for regression tasks, as it counts conditional on y). But for
>>     classification it could be easily added to the benchmarks.
>     I'm confused now. That's what TargetEncoder and leave-one-out
>     TargetEncoder do as well, right?.
>
>
> As far as I understand, that is not exactly what those do. The 
> TargetEncoder (as implemented in dirty_cat, category_encoders and 
> hccEncoders) will, for each category, calculate the expected value of 
> the target depending on the category. For binary classification this 
> indeed comes to counting the 0's and 1's, and there the information 
> contained in the result might be similar as the sklearn PR, but the 
> format is different: those packages calculate the probability (value 
> between 0 and 1 as number of 1's divided by number of samples in that 
> category) and return that as a single column, instead of returning two 
> columns with the counts for the 0's and 1's.
This is a standard case of the "binary special case", right? For 
multi-class you need multiple columns, right?
Doing a single column for binary makes sense, I think.

> And for regression this is not related to counting anymore, but just 
> the average of the target per category (in practice, the TargetEncoder 
> is computing the same for regression or binary classification: the 
> average of the target per category. But for regression, the 
> CountFeaturizer doesn't work since there are no discrete values in the 
> target to count).
I guess CountFeaturizer was not implemented with regression in mind.
Actually being able to do regression and classification in the same 
estimator shows that "CountFeaturizer"
is probably the wrong name.

>
> Furthermore, all of those implementations in the 3 mentioned packages 
> have some kind of regularization (empirical bayes shrinkage, or KFold 
> or leave-one-out cross-validation), while this is also not present in 
> the CountFeaturizer PR (but this aspect is of course something we want 
> to actually test in the benchmarks).
>
> Another thing I noticed in the CountFeaturizer implementation, is that 
> the behaviour differs when y is passed or not. First, I find it a bit 
> strange to do this as it is a quite different behaviour (counting the 
> categories (to just encode the categorical variable with a notion 
> about its frequency in the training set), or counting the target 
> depending on the category is quite different?). But also, when using a 
> transformer in a Pipeline, you don't control the passing of y, I 
> think? So in that way, you always have the behaviour of counting the 
> target.
> I would find it more logical to have those two things in two separate 
> transformers (if we think the "frequency encoder" is useful enough).
> (I need to give this feedback on the PR, but that will be for after 
> the holidays)
>
I'm pretty sure I mentioned that before, I think optional y is bad. I 
just thought it was weird but the pipeline argument is a good one.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181219/d65c6bfa/attachment.html>


More information about the scikit-learn mailing list