[scikit-learn] decision trees

Julio Antonio Soto de Vicente julio at esbet.es
Wed Mar 29 13:04:40 EDT 2017


IMO CART can handle categorical features just as good as CITrees, as long as we slightly change sklearn's implementation...

--
Julio

> El 29 mar 2017, a las 15:30, Andreas Mueller <t3kcit at gmail.com> escribió:
> 
> I'd argue that's why we should implement conditional inference trees ;)
> 
>> On 03/29/2017 05:56 AM, Olivier Grisel wrote:
>> Integer coding will indeed make the DT assume an arbitrary ordering
>> while one-hot encoding does not force the tree model to make that
>> assumption.
>> 
>> However in practice when the depth of the trees is not too limited (or
>> if you use a large enough ensemble of trees), the model will have
>> enough flexibility to introduce as many splits as necessary to isolate
>> individual categories in the integer and therefore the arbitrary
>> ordering assumption is not a problem.
>> 
>> On the other hand using one-hot encoding can introduce a detrimental
>> inductive bias on random forests: random forest uses uniform random
>> feature sampling when deciding which feature to split on (e.g. pick
>> the best split out of 25% of the features selected at random).
>> 
>> Let's consider the following example: assume you have an
>> heterogeneously typed dataset with 99 numeric features and 1
>> categorical feature with categorical cardinality 1000 (1000 possible
>> values for that features):
>> 
>> - the RF will have one chance in 100 to pick each feature (categorical
>> or numerical) as a candidate for the next split when using integer
>> coding,
>> - the RF will have 0.1% chance of picking each numerical feature and
>> 99% chance to select a candidate feature split on a category of the
>> unique categorical feature when using one-hot encoding.
>> 
>> The inductive bias of one-encoding on RFs can therefore completely
>> break the feature balancing. The feature encoding will also impact the
>> inductive bias with respect the importance of the depth of the trees,
>> even when feature splits are selected fully deterministically.
>> 
>> Finally one-hot encoding features with large categorical cardinalities
>> will be much slower then when using naive integer coding.
>> 
>> TL;DNR: naive theoretical analysis based only on the ordering
>> assumption can be misleading. Inductive biases of each feature
>> encoding are more complex to evaluate. Use cross-validation to decide
>> which is the best on your problem. Don't ignore computational
>> considerations (CPU and memory usage).
>> 
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


More information about the scikit-learn mailing list