[scikit-learn] benchmarking TargetEncoder Was: ANN Dirty_cat: learning on dirty categories
Andreas Mueller
t3kcit at gmail.com
Wed Dec 19 17:31:23 EST 2018
On 12/15/18 7:35 AM, Joris Van den Bossche wrote:
> Op vr 14 dec. 2018 om 16:46 schreef Andreas Mueller <t3kcit at gmail.com
> <mailto:t3kcit at gmail.com>>:
>
>> As far as I understand, the open PR is not a leave-one-out
>> TargetEncoder?
> I would want it to be :-/
>> I also did not yet add the CountFeaturizer from that scikit-learn
>> PR, because it is actually quite different (e.g it doesn't work
>> for regression tasks, as it counts conditional on y). But for
>> classification it could be easily added to the benchmarks.
> I'm confused now. That's what TargetEncoder and leave-one-out
> TargetEncoder do as well, right?.
>
>
> As far as I understand, that is not exactly what those do. The
> TargetEncoder (as implemented in dirty_cat, category_encoders and
> hccEncoders) will, for each category, calculate the expected value of
> the target depending on the category. For binary classification this
> indeed comes to counting the 0's and 1's, and there the information
> contained in the result might be similar as the sklearn PR, but the
> format is different: those packages calculate the probability (value
> between 0 and 1 as number of 1's divided by number of samples in that
> category) and return that as a single column, instead of returning two
> columns with the counts for the 0's and 1's.
This is a standard case of the "binary special case", right? For
multi-class you need multiple columns, right?
Doing a single column for binary makes sense, I think.
> And for regression this is not related to counting anymore, but just
> the average of the target per category (in practice, the TargetEncoder
> is computing the same for regression or binary classification: the
> average of the target per category. But for regression, the
> CountFeaturizer doesn't work since there are no discrete values in the
> target to count).
I guess CountFeaturizer was not implemented with regression in mind.
Actually being able to do regression and classification in the same
estimator shows that "CountFeaturizer"
is probably the wrong name.
>
> Furthermore, all of those implementations in the 3 mentioned packages
> have some kind of regularization (empirical bayes shrinkage, or KFold
> or leave-one-out cross-validation), while this is also not present in
> the CountFeaturizer PR (but this aspect is of course something we want
> to actually test in the benchmarks).
>
> Another thing I noticed in the CountFeaturizer implementation, is that
> the behaviour differs when y is passed or not. First, I find it a bit
> strange to do this as it is a quite different behaviour (counting the
> categories (to just encode the categorical variable with a notion
> about its frequency in the training set), or counting the target
> depending on the category is quite different?). But also, when using a
> transformer in a Pipeline, you don't control the passing of y, I
> think? So in that way, you always have the behaviour of counting the
> target.
> I would find it more logical to have those two things in two separate
> transformers (if we think the "frequency encoder" is useful enough).
> (I need to give this feedback on the PR, but that will be for after
> the holidays)
>
I'm pretty sure I mentioned that before, I think optional y is bad. I
just thought it was weird but the pipeline argument is a good one.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181219/d65c6bfa/attachment.html>
More information about the scikit-learn
mailing list