[scikit-learn] decision trees

Wed Mar 29 06:57:59 EDT 2017

There is https://github.com/scikit-learn/scikit-learn/pull/4899 .

It looks like it is waiting for review?

Raphael

On 29 March 2017 at 11:50, federico vaggi <vaggi.federico at gmail.com> wrote:
> That's a really good point.  Do you know of any systematic studies about the
> two different encodings?
>
> Finally: wasn't there a PR for RF to accept categorical variables as inputs?
>
> On Wed, 29 Mar 2017 at 11:57, Olivier Grisel <olivier.grisel at ensta.org>
> wrote:
>>
>> Integer coding will indeed make the DT assume an arbitrary ordering
>> while one-hot encoding does not force the tree model to make that
>> assumption.
>>
>> However in practice when the depth of the trees is not too limited (or
>> if you use a large enough ensemble of trees), the model will have
>> enough flexibility to introduce as many splits as necessary to isolate
>> individual categories in the integer and therefore the arbitrary
>> ordering assumption is not a problem.
>>
>> On the other hand using one-hot encoding can introduce a detrimental
>> inductive bias on random forests: random forest uses uniform random
>> feature sampling when deciding which feature to split on (e.g. pick
>> the best split out of 25% of the features selected at random).
>>
>> Let's consider the following example: assume you have an
>> heterogeneously typed dataset with 99 numeric features and 1
>> categorical feature with categorical cardinality 1000 (1000 possible
>> values for that features):
>>
>> - the RF will have one chance in 100 to pick each feature (categorical
>> or numerical) as a candidate for the next split when using integer
>> coding,
>> - the RF will have 0.1% chance of picking each numerical feature and
>> 99% chance to select a candidate feature split on a category of the
>> unique categorical feature when using one-hot encoding.
>>
>> The inductive bias of one-encoding on RFs can therefore completely
>> break the feature balancing. The feature encoding will also impact the
>> inductive bias with respect the importance of the depth of the trees,
>> even when feature splits are selected fully deterministically.
>>
>> Finally one-hot encoding features with large categorical cardinalities
>> will be much slower then when using naive integer coding.
>>
>> TL;DNR: naive theoretical analysis based only on the ordering
>> assumption can be misleading. Inductive biases of each feature
>> encoding are more complex to evaluate. Use cross-validation to decide
>> which is the best on your problem. Don't ignore computational
>> considerations (CPU and memory usage).
>>
>> --
>> Olivier
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>