[scikit-learn] ANN Dirty_cat: learning on dirty categories

Andreas Mueller t3kcit at gmail.com
Tue Nov 20 16:35:43 EST 2018



On 11/20/18 4:16 PM, Gael Varoquaux wrote:
> - the naive way is not the right one: just computing the average of y
>    for each category leads to overfitting quite fast
>
> - it can be done cross-validated, splitting the train data, in a
>    "cross-fit" strategy (seehttps://github.com/dirty-cat/dirty_cat/issues/53)
This is called leave-one-out in the category_encoding library, I think,
and that's what my first implementation would be.
>
> - it can be done using empirical-Bayes shrinkage, which is what we
>    currently do in dirty_cat.
Reference / explanation?
>
> We are planning to do heavy benchmarking of those strategies, to figure
> out tradeoff. But we won't get to it before February, I am afraid.
aww ;)


More information about the scikit-learn mailing list