[scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

C W tmrsg11 at gmail.com
Fri Oct 4 12:48:24 EDT 2019


I'm getting some funny results. I am doing a regression decision tree, the
response variables are assigned to levels.

The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1,
Audi=2) as numerical values, not category.

The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding wrong? How does
the sklearn know internally 0 vs. 1 is categorical, not numerical?

In R for instance, you do as.factor(), which explicitly states the data
type.

Thank you!


On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller <t3kcit at gmail.com> wrote:

>
>
> On 9/15/19 8:16 AM, Guillaume Lemaître wrote:
>
>
>
> On Sat, 14 Sep 2019 at 20:59, C W <tmrsg11 at gmail.com> wrote:
>
>> Thanks, Guillaume.
>> Column transformer looks pretty neat. I've also heard though, this
>> pipeline can be tedious to set up? Specifying what you want for every
>> feature is a pain.
>>
>
> It would be interesting for us which part of the pipeline is tedious to
> set up to know if we can improve something there.
> Do you mean, that you would like to automatically detect of which type of
> feature (categorical/numerical) and apply a
> default encoder/scaling such as discuss there:
> https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127
>
> IMO, one a user perspective, it would be cleaner in some cases at the cost
> of applying blindly a black box
> which might be dangerous.
>
> Also see
> https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor
> Which basically does that.
>
>
>
>
>>
>> Jaiver,
>> Actually, you guessed right. My real data has only one numerical
>> variable, looks more like this:
>>
>> Gender Date            Income  Car   Attendance
>> Male     2019/3/01   10000   BMW          Yes
>> Female 2019/5/02    9000   Toyota          No
>> Male     2019/7/15   12000    Audi           Yes
>>
>> I am predicting income using all other categorical variables. Maybe it is
>> catboost!
>>
>> Thanks,
>>
>> M
>>
>>
>>
>>
>>
>>
>> On Sat, Sep 14, 2019 at 9:25 AM Javier López <jlopez at ende.cc>
>> <jlopez at ende.cc> wrote:
>>
>>> If you have datasets with many categorical features, and perhaps many
>>> categories, the tools in sklearn are quite limited,
>>> but there are alternative implementations of boosted trees that are
>>> designed with categorical features in mind. Take a look
>>> at catboost [1], which has an sklearn-compatible API.
>>>
>>> J
>>>
>>> [1] https://catboost.ai/
>>>
>>> On Sat, Sep 14, 2019 at 3:40 AM C W <tmrsg11 at gmail.com> wrote:
>>>
>>>> Hello all,
>>>> I'm very confused. Can the decision tree module handle both continuous
>>>> and categorical features in the dataset? In this case, it's just CART
>>>> (Classification and Regression Trees).
>>>>
>>>> For example,
>>>> Gender Age Income  Car   Attendance
>>>> Male     30   10000   BMW          Yes
>>>> Female 35     9000  Toyota          No
>>>> Male     50   12000    Audi           Yes
>>>>
>>>> According to the documentation
>>>> https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart,
>>>> it can not!
>>>>
>>>> It says: "scikit-learn implementation does not support categorical
>>>> variables for now".
>>>>
>>>> Is this true? If not, can someone point me to an example? If yes, what
>>>> do people do?
>>>>
>>>> Thank you very much!
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
> --
> Guillaume Lemaitre
> INRIA Saclay - Parietal team
> Center for Data Science Paris-Saclay
> https://glemaitre.github.io/
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20191004/c3a2bead/attachment.html>


More information about the scikit-learn mailing list