[scikit-learn] ANN Dirty_cat: learning on dirty categories

Wed Nov 21 10:34:24 EST 2018

On Wed, Nov 21, 2018 at 09:47:13AM -0500, Andreas Mueller wrote:
> The PR is over a year old already, and you hadn't voiced any opposition
> there.

My bad, sorry. Given the name, I had not guessed the link between the PR
and encoding of categorical features. I find myself very much in
agreement with the original issue and its discussion:
https://github.com/scikit-learn/scikit-learn/issues/5853 concerns about
the name and importance of at least considering prior smoothing. I do not
see these reflected in the PR.

In general, the fact that there is not much literature on this implies
that we should be benchmarking our choices. The more I understand kaggle,
the less I think that we can fully use it as an inclusion argument:
people do transforms that end up to be very specific to one challenge. On
the specific problem of categorical encoding, we've tried to do
systematic analysis of some of these, and were not very successful
empirically (eg hashing encoding). This is not at all a vote against
target encoding, which our benchmarks showed was very useful, but just a
push for benchmarking PRs, in particular when they do not correspond to
well cited work (which is our standard inclusion criterion).

Joris has just accepted to help with benchmarking. We can have
preliminary results earlier. The question really is: out of the different
variants that exist, which one should we choose. I think that it is a
legitimate question that arises on many of our PRs.

But in general, I don't think that we should rush things because of
deadlines. Consequences of a rush are that we need to change things after
merge, which is more work. I know that it is slow, but we are quite a
central package.

Gaël