[scikit-learn] decision trees

Olivier Grisel olivier.grisel at ensta.org
Wed Mar 29 05:56:38 EDT 2017


Integer coding will indeed make the DT assume an arbitrary ordering
while one-hot encoding does not force the tree model to make that
assumption.

However in practice when the depth of the trees is not too limited (or
if you use a large enough ensemble of trees), the model will have
enough flexibility to introduce as many splits as necessary to isolate
individual categories in the integer and therefore the arbitrary
ordering assumption is not a problem.

On the other hand using one-hot encoding can introduce a detrimental
inductive bias on random forests: random forest uses uniform random
feature sampling when deciding which feature to split on (e.g. pick
the best split out of 25% of the features selected at random).

Let's consider the following example: assume you have an
heterogeneously typed dataset with 99 numeric features and 1
categorical feature with categorical cardinality 1000 (1000 possible
values for that features):

- the RF will have one chance in 100 to pick each feature (categorical
or numerical) as a candidate for the next split when using integer
coding,
- the RF will have 0.1% chance of picking each numerical feature and
99% chance to select a candidate feature split on a category of the
unique categorical feature when using one-hot encoding.

The inductive bias of one-encoding on RFs can therefore completely
break the feature balancing. The feature encoding will also impact the
inductive bias with respect the importance of the depth of the trees,
even when feature splits are selected fully deterministically.

Finally one-hot encoding features with large categorical cardinalities
will be much slower then when using naive integer coding.

TL;DNR: naive theoretical analysis based only on the ordering
assumption can be misleading. Inductive biases of each feature
encoding are more complex to evaluate. Use cross-validation to decide
which is the best on your problem. Don't ignore computational
considerations (CPU and memory usage).

-- 
Olivier


More information about the scikit-learn mailing list