<div>That's a really good point.  Do you know of any systematic studies about the two different encodings?</div><div><br></div><div>Finally: wasn't there a PR for RF to accept categorical variables as inputs?</div><div><br></div><div><div class="gmail_quote"><div>On Wed, 29 Mar 2017 at 11:57, Olivier Grisel <<a href="mailto:olivier.grisel@ensta.org">olivier.grisel@ensta.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Integer coding will indeed make the DT assume an arbitrary ordering<br class="gmail_msg">

while one-hot encoding does not force the tree model to make that<br class="gmail_msg">

assumption.<br class="gmail_msg">

<br class="gmail_msg">

However in practice when the depth of the trees is not too limited (or<br class="gmail_msg">

if you use a large enough ensemble of trees), the model will have<br class="gmail_msg">

enough flexibility to introduce as many splits as necessary to isolate<br class="gmail_msg">

individual categories in the integer and therefore the arbitrary<br class="gmail_msg">

ordering assumption is not a problem.<br class="gmail_msg">

<br class="gmail_msg">

On the other hand using one-hot encoding can introduce a detrimental<br class="gmail_msg">

inductive bias on random forests: random forest uses uniform random<br class="gmail_msg">

feature sampling when deciding which feature to split on (e.g. pick<br class="gmail_msg">

the best split out of 25% of the features selected at random).<br class="gmail_msg">

<br class="gmail_msg">

Let's consider the following example: assume you have an<br class="gmail_msg">

heterogeneously typed dataset with 99 numeric features and 1<br class="gmail_msg">

categorical feature with categorical cardinality 1000 (1000 possible<br class="gmail_msg">

values for that features):<br class="gmail_msg">

<br class="gmail_msg">

- the RF will have one chance in 100 to pick each feature (categorical<br class="gmail_msg">

or numerical) as a candidate for the next split when using integer<br class="gmail_msg">

coding,<br class="gmail_msg">

- the RF will have 0.1% chance of picking each numerical feature and<br class="gmail_msg">

99% chance to select a candidate feature split on a category of the<br class="gmail_msg">

unique categorical feature when using one-hot encoding.<br class="gmail_msg">

<br class="gmail_msg">

The inductive bias of one-encoding on RFs can therefore completely<br class="gmail_msg">

break the feature balancing. The feature encoding will also impact the<br class="gmail_msg">

inductive bias with respect the importance of the depth of the trees,<br class="gmail_msg">

even when feature splits are selected fully deterministically.<br class="gmail_msg">

<br class="gmail_msg">

Finally one-hot encoding features with large categorical cardinalities<br class="gmail_msg">

will be much slower then when using naive integer coding.<br class="gmail_msg">

<br class="gmail_msg">

TL;DNR: naive theoretical analysis based only on the ordering<br class="gmail_msg">

assumption can be misleading. Inductive biases of each feature<br class="gmail_msg">

encoding are more complex to evaluate. Use cross-validation to decide<br class="gmail_msg">

which is the best on your problem. Don't ignore computational<br class="gmail_msg">

considerations (CPU and memory usage).<br class="gmail_msg">

<br class="gmail_msg">

--<br class="gmail_msg">

Olivier<br class="gmail_msg">

_______________________________________________<br class="gmail_msg">

scikit-learn mailing list<br class="gmail_msg">

<a href="mailto:scikit-learn@python.org" class="gmail_msg" target="_blank">scikit-learn@python.org</a><br class="gmail_msg">

<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" class="gmail_msg" target="_blank">https://mail.python.org/mailman/listinfo/scikit-learn</a><br class="gmail_msg">

</blockquote></div></div>