[scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

Guillaume Lemaître g.lemaitre58 at gmail.com
Sun Sep 15 08:16:29 EDT 2019


On Sat, 14 Sep 2019 at 20:59, C W <tmrsg11 at gmail.com> wrote:

> Thanks, Guillaume.
> Column transformer looks pretty neat. I've also heard though, this
> pipeline can be tedious to set up? Specifying what you want for every
> feature is a pain.
>

It would be interesting for us which part of the pipeline is tedious to set
up to know if we can improve something there.
Do you mean, that you would like to automatically detect of which type of
feature (categorical/numerical) and apply a
default encoder/scaling such as discuss there:
https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127

IMO, one a user perspective, it would be cleaner in some cases at the cost
of applying blindly a black box
which might be dangerous.


>
> Jaiver,
> Actually, you guessed right. My real data has only one numerical
> variable, looks more like this:
>
> Gender Date            Income  Car   Attendance
> Male     2019/3/01   10000   BMW          Yes
> Female 2019/5/02    9000   Toyota          No
> Male     2019/7/15   12000    Audi           Yes
>
> I am predicting income using all other categorical variables. Maybe it is
> catboost!
>
> Thanks,
>
> M
>
>
>
>
>
>
> On Sat, Sep 14, 2019 at 9:25 AM Javier López <jlopez at ende.cc> wrote:
>
>> If you have datasets with many categorical features, and perhaps many
>> categories, the tools in sklearn are quite limited,
>> but there are alternative implementations of boosted trees that are
>> designed with categorical features in mind. Take a look
>> at catboost [1], which has an sklearn-compatible API.
>>
>> J
>>
>> [1] https://catboost.ai/
>>
>> On Sat, Sep 14, 2019 at 3:40 AM C W <tmrsg11 at gmail.com> wrote:
>>
>>> Hello all,
>>> I'm very confused. Can the decision tree module handle both continuous
>>> and categorical features in the dataset? In this case, it's just CART
>>> (Classification and Regression Trees).
>>>
>>> For example,
>>> Gender Age Income  Car   Attendance
>>> Male     30   10000   BMW          Yes
>>> Female 35     9000  Toyota          No
>>> Male     50   12000    Audi           Yes
>>>
>>> According to the documentation
>>> https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart,
>>> it can not!
>>>
>>> It says: "scikit-learn implementation does not support categorical
>>> variables for now".
>>>
>>> Is this true? If not, can someone point me to an example? If yes, what
>>> do people do?
>>>
>>> Thank you very much!
>>>
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>


-- 
Guillaume Lemaitre
INRIA Saclay - Parietal team
Center for Data Science Paris-Saclay
https://glemaitre.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190915/0d4e0680/attachment-0001.html>


More information about the scikit-learn mailing list