[scikit-learn] Categorical Encoding of high cardinality variables

Sole Galli solegalli1 at gmail.com
Tue Apr 23 20:00:15 EDT 2019


Hello everyone,

I am Sole, I started the conversation on feature engine
<https://feature-engine.readthedocs.io>, a package I created for feature
engineering.

Regarding the grouping of *rare /  infrequent* categories into an umbrella
term like "Rare", "Other", etc, which Federico raised recently, I would
like to provide some literature at the end of this email, that quotes the
use of this procedure. These are a series of articles by the best solutions
to the 2009 KDD annual competition, which were compiled into one "book
<http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf>", and I am sure
you are aware of it already. I would also like to highlight that this is
extremely common practice in the industry, not only to avoid overfitting,
but also to handle unseen categories when models are deployed. It would be
great to see this functionality added to both the OrdinalEncoder and the
OneHotEncoder, with triggers on the representation of the label in the
dataset (eg. percentage)

Pointing to the main quotes from these articles
<http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf>:

Page 4 of the summary and introductory article:
"For categorical variables, grouping of under-represented categories proved
to be useful  to avoid overfitting. The winners of the fast and the slow
track used similar strategies consisting in retaining the most populated
categories and coarsely grouping the others in  an unsupervised way"

Page 23:
"Most of the learning algorithms we were planning to use do not handle
categorical variables,  so we needed to recode them. This was done in a
standard way, by generating indicator vari-  ables for the different values
a categorical attribute could take. The only slightly non-standard
decision was to limit ourselves to encoding only the 10 most common values
of each categorical  attribute, rather than all the values, in order to
avoid an explosion in the number of features from  variables with a huge
vocabulary"

Page 36:
"We consolidate the extremely low populated entries  (having fewer than 200
examples) with their neighbors to smooth out the outliers. Similarly, we
group some categorical variables which have a large number of entries (  >
1000 distinct values)  into 100 categories."

See bulletpoints in Page 47

I hope you find these useful.

Let me know if / how I can help.

Regards

Sole






On Fri, 19 Apr 2019 at 17:54, federico vaggi <vaggi.federico at gmail.com>
wrote:

> Hi everyone,
>
> I wanted to use the scikit-learn transformer API to clean up some messy
> data as input to a neural network.  One of the steps involves converting
> categorical variables (of very high cardinality) into integers for use in
> an embedding layer.
>
> Unfortunately, I cannot quite use LabelEncoder to do solve this.  When
> dealing with categorical variables with very high cardinality, I found it
> useful in practice to have a threshold value for the frequency under which
> a variable ends up with the 'unk' or 'rare' label.  This same label would
> also end up applied at test time to entries that were not observed in the
> train set.
>
> This is relatively straightforward to add to the existing label encoder
> code, but it breaks the contract slightly: if we encode some variables with
> a 'rare' label, then the transform operation is no longer a bijection.
>
> Is this feature too niche for the main sklearn?  I saw there was a package
> (
> https://feature-engine.readthedocs.io/en/latest/RareLabelCategoricalEncoder.html)
> that implemented a similar feature discussed in the mailing list.
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190424/7234fdd4/attachment.html>


More information about the scikit-learn mailing list