[scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

C W tmrsg11 at gmail.com
Sat Sep 14 01:26:58 EDT 2019


Ahh, you are right. Regression vs. Classification is about the type of
target variable, not features.

Thanks, more clear now.

Mike

On Sat, Sep 14, 2019 at 1:23 AM Sebastian Raschka <mail at sebastianraschka.com>
wrote:

> Hi Mike,
>
> just to make sure we are on the same page,
>
> > I have mixed data type (continuous and categorical). Should I
> tree.DecisionTreeClassifier() or tree.DecisionTreeRegressor()?
>
> that's independent from the previous email. The comment
>
> > > "scikit-learn implementation does not support categorical variables
> for now".
>
> we discussed via the previous email was referring to feature variables.
> Whether you choose the DT regressor or classifier depends on the format of
> your target variable.
>
> Best,
> Sebastian
>
> > On Sep 13, 2019, at 11:41 PM, C W <tmrsg11 at gmail.com> wrote:
> >
> > Thanks, Sebastian. It's great to know that it works, just need to do
> one-hot-encoding first.
> >
> > I have mixed data type (continuous and categorical). Should I
> tree.DecisionTreeClassifier() or tree.DecisionTreeRegressor()?
> >
> > I'm guessing tree.DecisionTreeClassifier()?
> >
> > Best,
> >
> > Mike
> >
> > On Fri, Sep 13, 2019 at 11:59 PM Sebastian Raschka <
> mail at sebastianraschka.com> wrote:
> > Hi,
> >
> > if you have the category "car" as shown in your example, this would
> effectively be something like
> >
> > BMW=0
> > Toyota=1
> > Audi=2
> >
> > Sure, the algorithm will execute just fine on the feature column with
> values in {0, 1, 2}. However, the problem is that it will come up with
> binary rules like x_i>= 0.5, x_i>= 1.5, and x_i>= 2.5. I.e., it will treat
> it is a continuous variable.
> >
> > What you can do is to encode this feature via one-hot encoding --
> basically extend it into 2 (or 3) binary variables. This has it's own
> problems (if you have a feature with many possible values, you will end up
> with a large number of binary variables, and they may dominate in the
> resulting tree over other feature variables).
> >
> > In any case, I guess this is what
> >
> > > "scikit-learn implementation does not support categorical variables
> for now".
> >
> >
> > means ;).
> >
> > Best,
> > Sebastian
> >
> > > On Sep 13, 2019, at 9:38 PM, C W <tmrsg11 at gmail.com> wrote:
> > >
> > > Hello all,
> > > I'm very confused. Can the decision tree module handle both continuous
> and categorical features in the dataset? In this case, it's just CART
> (Classification and Regression Trees).
> > >
> > > For example,
> > > Gender Age Income  Car   Attendance
> > > Male     30   10000   BMW          Yes
> > > Female 35     9000  Toyota          No
> > > Male     50   12000    Audi           Yes
> > >
> > > According to the documentation
> https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart,
> it can not!
> > >
> > > It says: "scikit-learn implementation does not support categorical
> variables for now".
> > >
> > > Is this true? If not, can someone point me to an example? If yes, what
> do people do?
> > >
> > > Thank you very much!
> > >
> > >
> > >
> > > _______________________________________________
> > > scikit-learn mailing list
> > > scikit-learn at python.org
> > > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190914/f934b83b/attachment.html>


More information about the scikit-learn mailing list