[scikit-learn] decision trees
Andrew Howe
ahowe42 at gmail.com
Wed Mar 29 06:46:46 EDT 2017
Thanks very much for the thorough answer. I didn't think about the
inductive bias issue with my forests. I'll evaluate both set of coding for
my unordered categoricals.
Andrew
<~~~~~~~~~~~~~~~~~~~~~~~~~~~>
J. Andrew Howe, PhD
www.andrewhowe.com
http://www.linkedin.com/in/ahowe42
https://www.researchgate.net/profile/John_Howe12/
I live to learn, so I can learn to live. - me
<~~~~~~~~~~~~~~~~~~~~~~~~~~~>
On Wed, Mar 29, 2017 at 12:56 PM, Olivier Grisel <olivier.grisel at ensta.org>
wrote:
> Integer coding will indeed make the DT assume an arbitrary ordering
> while one-hot encoding does not force the tree model to make that
> assumption.
>
> However in practice when the depth of the trees is not too limited (or
> if you use a large enough ensemble of trees), the model will have
> enough flexibility to introduce as many splits as necessary to isolate
> individual categories in the integer and therefore the arbitrary
> ordering assumption is not a problem.
>
> On the other hand using one-hot encoding can introduce a detrimental
> inductive bias on random forests: random forest uses uniform random
> feature sampling when deciding which feature to split on (e.g. pick
> the best split out of 25% of the features selected at random).
>
> Let's consider the following example: assume you have an
> heterogeneously typed dataset with 99 numeric features and 1
> categorical feature with categorical cardinality 1000 (1000 possible
> values for that features):
>
> - the RF will have one chance in 100 to pick each feature (categorical
> or numerical) as a candidate for the next split when using integer
> coding,
> - the RF will have 0.1% chance of picking each numerical feature and
> 99% chance to select a candidate feature split on a category of the
> unique categorical feature when using one-hot encoding.
>
> The inductive bias of one-encoding on RFs can therefore completely
> break the feature balancing. The feature encoding will also impact the
> inductive bias with respect the importance of the depth of the trees,
> even when feature splits are selected fully deterministically.
>
> Finally one-hot encoding features with large categorical cardinalities
> will be much slower then when using naive integer coding.
>
> TL;DNR: naive theoretical analysis based only on the ordering
> assumption can be misleading. Inductive biases of each feature
> encoding are more complex to evaluate. Use cross-validation to decide
> which is the best on your problem. Don't ignore computational
> considerations (CPU and memory usage).
>
> --
> Olivier
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170329/1bb2ec08/attachment.html>
More information about the scikit-learn
mailing list