<div>That's a really good point. Do you know of any systematic studies about the two different encodings?</div><div><br></div><div>Finally: wasn't there a PR for RF to accept categorical variables as inputs?</div><div><br></div><div><div class="gmail_quote"><div>On Wed, 29 Mar 2017 at 11:57, Olivier Grisel <<a href="mailto:olivier.grisel@ensta.org">olivier.grisel@ensta.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Integer coding will indeed make the DT assume an arbitrary ordering<br class="gmail_msg">
while one-hot encoding does not force the tree model to make that<br class="gmail_msg">
assumption.<br class="gmail_msg">
<br class="gmail_msg">
However in practice when the depth of the trees is not too limited (or<br class="gmail_msg">
if you use a large enough ensemble of trees), the model will have<br class="gmail_msg">
enough flexibility to introduce as many splits as necessary to isolate<br class="gmail_msg">
individual categories in the integer and therefore the arbitrary<br class="gmail_msg">
ordering assumption is not a problem.<br class="gmail_msg">
<br class="gmail_msg">
On the other hand using one-hot encoding can introduce a detrimental<br class="gmail_msg">
inductive bias on random forests: random forest uses uniform random<br class="gmail_msg">
feature sampling when deciding which feature to split on (e.g. pick<br class="gmail_msg">
the best split out of 25% of the features selected at random).<br class="gmail_msg">
<br class="gmail_msg">
Let's consider the following example: assume you have an<br class="gmail_msg">
heterogeneously typed dataset with 99 numeric features and 1<br class="gmail_msg">
categorical feature with categorical cardinality 1000 (1000 possible<br class="gmail_msg">
values for that features):<br class="gmail_msg">
<br class="gmail_msg">
- the RF will have one chance in 100 to pick each feature (categorical<br class="gmail_msg">
or numerical) as a candidate for the next split when using integer<br class="gmail_msg">
coding,<br class="gmail_msg">
- the RF will have 0.1% chance of picking each numerical feature and<br class="gmail_msg">
99% chance to select a candidate feature split on a category of the<br class="gmail_msg">
unique categorical feature when using one-hot encoding.<br class="gmail_msg">
<br class="gmail_msg">
The inductive bias of one-encoding on RFs can therefore completely<br class="gmail_msg">
break the feature balancing. The feature encoding will also impact the<br class="gmail_msg">
inductive bias with respect the importance of the depth of the trees,<br class="gmail_msg">
even when feature splits are selected fully deterministically.<br class="gmail_msg">
<br class="gmail_msg">
Finally one-hot encoding features with large categorical cardinalities<br class="gmail_msg">
will be much slower then when using naive integer coding.<br class="gmail_msg">
<br class="gmail_msg">
TL;DNR: naive theoretical analysis based only on the ordering<br class="gmail_msg">
assumption can be misleading. Inductive biases of each feature<br class="gmail_msg">
encoding are more complex to evaluate. Use cross-validation to decide<br class="gmail_msg">
which is the best on your problem. Don't ignore computational<br class="gmail_msg">
considerations (CPU and memory usage).<br class="gmail_msg">
<br class="gmail_msg">
--<br class="gmail_msg">
Olivier<br class="gmail_msg">
_______________________________________________<br class="gmail_msg">
scikit-learn mailing list<br class="gmail_msg">
<a href="mailto:scikit-learn@python.org" class="gmail_msg" target="_blank">scikit-learn@python.org</a><br class="gmail_msg">
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" class="gmail_msg" target="_blank">https://mail.python.org/mailman/listinfo/scikit-learn</a><br class="gmail_msg">
</blockquote></div></div>