[scikit-learn] decision trees

Wed Mar 29 10:30:21 EDT 2017

I'd argue that's why we should implement conditional inference trees ;)

On 03/29/2017 05:56 AM, Olivier Grisel wrote:
> Integer coding will indeed make the DT assume an arbitrary ordering
> while one-hot encoding does not force the tree model to make that
> assumption.
>
> However in practice when the depth of the trees is not too limited (or
> if you use a large enough ensemble of trees), the model will have
> enough flexibility to introduce as many splits as necessary to isolate
> individual categories in the integer and therefore the arbitrary
> ordering assumption is not a problem.
>
> On the other hand using one-hot encoding can introduce a detrimental
> inductive bias on random forests: random forest uses uniform random
> feature sampling when deciding which feature to split on (e.g. pick
> the best split out of 25% of the features selected at random).
>
> Let's consider the following example: assume you have an
> heterogeneously typed dataset with 99 numeric features and 1
> categorical feature with categorical cardinality 1000 (1000 possible
> values for that features):
>
> - the RF will have one chance in 100 to pick each feature (categorical
> or numerical) as a candidate for the next split when using integer
> coding,
> - the RF will have 0.1% chance of picking each numerical feature and
> 99% chance to select a candidate feature split on a category of the
> unique categorical feature when using one-hot encoding.
>
> The inductive bias of one-encoding on RFs can therefore completely
> break the feature balancing. The feature encoding will also impact the
> inductive bias with respect the importance of the depth of the trees,
> even when feature splits are selected fully deterministically.
>
> Finally one-hot encoding features with large categorical cardinalities
> will be much slower then when using naive integer coding.
>
> TL;DNR: naive theoretical analysis based only on the ordering
> assumption can be misleading. Inductive biases of each feature
> encoding are more complex to evaluate. Use cross-validation to decide
> which is the best on your problem. Don't ignore computational
> considerations (CPU and memory usage).
>