[scikit-learn] Categorical Encoding of high cardinality variables

Fri Apr 19 12:52:51 EDT 2019

Hi everyone,

I wanted to use the scikit-learn transformer API to clean up some messy
data as input to a neural network.  One of the steps involves converting
categorical variables (of very high cardinality) into integers for use in
an embedding layer.

Unfortunately, I cannot quite use LabelEncoder to do solve this.  When
dealing with categorical variables with very high cardinality, I found it
useful in practice to have a threshold value for the frequency under which
a variable ends up with the 'unk' or 'rare' label.  This same label would
also end up applied at test time to entries that were not observed in the
train set.

This is relatively straightforward to add to the existing label encoder
code, but it breaks the contract slightly: if we encode some variables with
a 'rare' label, then the transform operation is no longer a bijection.

Is this feature too niche for the main sklearn?  I saw there was a package (
https://feature-engine.readthedocs.io/en/latest/RareLabelCategoricalEncoder.html)
that implemented a similar feature discussed in the mailing list.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190419/864826f0/attachment.html>