[scikit-learn] decision trees

federico vaggi vaggi.federico at gmail.com
Wed Mar 29 06:50:51 EDT 2017


That's a really good point.  Do you know of any systematic studies about
the two different encodings?

Finally: wasn't there a PR for RF to accept categorical variables as inputs?

On Wed, 29 Mar 2017 at 11:57, Olivier Grisel <olivier.grisel at ensta.org>
wrote:

> Integer coding will indeed make the DT assume an arbitrary ordering
> while one-hot encoding does not force the tree model to make that
> assumption.
>
> However in practice when the depth of the trees is not too limited (or
> if you use a large enough ensemble of trees), the model will have
> enough flexibility to introduce as many splits as necessary to isolate
> individual categories in the integer and therefore the arbitrary
> ordering assumption is not a problem.
>
> On the other hand using one-hot encoding can introduce a detrimental
> inductive bias on random forests: random forest uses uniform random
> feature sampling when deciding which feature to split on (e.g. pick
> the best split out of 25% of the features selected at random).
>
> Let's consider the following example: assume you have an
> heterogeneously typed dataset with 99 numeric features and 1
> categorical feature with categorical cardinality 1000 (1000 possible
> values for that features):
>
> - the RF will have one chance in 100 to pick each feature (categorical
> or numerical) as a candidate for the next split when using integer
> coding,
> - the RF will have 0.1% chance of picking each numerical feature and
> 99% chance to select a candidate feature split on a category of the
> unique categorical feature when using one-hot encoding.
>
> The inductive bias of one-encoding on RFs can therefore completely
> break the feature balancing. The feature encoding will also impact the
> inductive bias with respect the importance of the depth of the trees,
> even when feature splits are selected fully deterministically.
>
> Finally one-hot encoding features with large categorical cardinalities
> will be much slower then when using naive integer coding.
>
> TL;DNR: naive theoretical analysis based only on the ordering
> assumption can be misleading. Inductive biases of each feature
> encoding are more complex to evaluate. Use cross-validation to decide
> which is the best on your problem. Don't ignore computational
> considerations (CPU and memory usage).
>
> --
> Olivier
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170329/5da66ce7/attachment.html>


More information about the scikit-learn mailing list