[scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

Guillaume Lemaître g.lemaitre58 at gmail.com
Sat Sep 14 05:14:17 EDT 2019


I will just add that if you have heterogeneous types, you might want to
look at the ColumnTransformer:
https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html

You might want to apply some scaling (would not be relevant for tree
thought) and encode categories
(ordinal encoding for the tree-based) and then dispatch these data to a
decision tree.

The previous example shows how to construct such a preprocessor and
pipeline it with a predictor.

On Sat, 14 Sep 2019 at 07:29, C W <tmrsg11 at gmail.com> wrote:

> Ahh, you are right. Regression vs. Classification is about the type of
> target variable, not features.
>
> Thanks, more clear now.
>
> Mike
>
> On Sat, Sep 14, 2019 at 1:23 AM Sebastian Raschka <
> mail at sebastianraschka.com> wrote:
>
>> Hi Mike,
>>
>> just to make sure we are on the same page,
>>
>> > I have mixed data type (continuous and categorical). Should I
>> tree.DecisionTreeClassifier() or tree.DecisionTreeRegressor()?
>>
>> that's independent from the previous email. The comment
>>
>> > > "scikit-learn implementation does not support categorical variables
>> for now".
>>
>> we discussed via the previous email was referring to feature variables.
>> Whether you choose the DT regressor or classifier depends on the format of
>> your target variable.
>>
>> Best,
>> Sebastian
>>
>> > On Sep 13, 2019, at 11:41 PM, C W <tmrsg11 at gmail.com> wrote:
>> >
>> > Thanks, Sebastian. It's great to know that it works, just need to do
>> one-hot-encoding first.
>> >
>> > I have mixed data type (continuous and categorical). Should I
>> tree.DecisionTreeClassifier() or tree.DecisionTreeRegressor()?
>> >
>> > I'm guessing tree.DecisionTreeClassifier()?
>> >
>> > Best,
>> >
>> > Mike
>> >
>> > On Fri, Sep 13, 2019 at 11:59 PM Sebastian Raschka <
>> mail at sebastianraschka.com> wrote:
>> > Hi,
>> >
>> > if you have the category "car" as shown in your example, this would
>> effectively be something like
>> >
>> > BMW=0
>> > Toyota=1
>> > Audi=2
>> >
>> > Sure, the algorithm will execute just fine on the feature column with
>> values in {0, 1, 2}. However, the problem is that it will come up with
>> binary rules like x_i>= 0.5, x_i>= 1.5, and x_i>= 2.5. I.e., it will treat
>> it is a continuous variable.
>> >
>> > What you can do is to encode this feature via one-hot encoding --
>> basically extend it into 2 (or 3) binary variables. This has it's own
>> problems (if you have a feature with many possible values, you will end up
>> with a large number of binary variables, and they may dominate in the
>> resulting tree over other feature variables).
>> >
>> > In any case, I guess this is what
>> >
>> > > "scikit-learn implementation does not support categorical variables
>> for now".
>> >
>> >
>> > means ;).
>> >
>> > Best,
>> > Sebastian
>> >
>> > > On Sep 13, 2019, at 9:38 PM, C W <tmrsg11 at gmail.com> wrote:
>> > >
>> > > Hello all,
>> > > I'm very confused. Can the decision tree module handle both
>> continuous and categorical features in the dataset? In this case, it's just
>> CART (Classification and Regression Trees).
>> > >
>> > > For example,
>> > > Gender Age Income  Car   Attendance
>> > > Male     30   10000   BMW          Yes
>> > > Female 35     9000  Toyota          No
>> > > Male     50   12000    Audi           Yes
>> > >
>> > > According to the documentation
>> https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart,
>> it can not!
>> > >
>> > > It says: "scikit-learn implementation does not support categorical
>> variables for now".
>> > >
>> > > Is this true? If not, can someone point me to an example? If yes,
>> what do people do?
>> > >
>> > > Thank you very much!
>> > >
>> > >
>> > >
>> > > _______________________________________________
>> > > scikit-learn mailing list
>> > > scikit-learn at python.org
>> > > https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>


-- 
Guillaume Lemaitre
INRIA Saclay - Parietal team
Center for Data Science Paris-Saclay
https://glemaitre.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190914/b34fdb12/attachment-0001.html>


More information about the scikit-learn mailing list