[scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

Fri Oct 4 18:34:50 EDT 2019

I don't understand your answer.

Why after one-hot-encoding it still outputs greater than 0.5 or less
than? Does sklearn website have a working example on categorical input?

Thanks!

On Fri, Oct 4, 2019 at 3:48 PM Sebastian Raschka <mail at sebastianraschka.com>
wrote:

> Like Nicolas said, the 0.5 is just a workaround but will do the right
> thing on the one-hot encoded variables, here. You will find that the
> threshold is always at 0.5 for these variables. I.e., what it will do is to
> use the following conversion:
>
> treat as car_Audi=1 if car_Audi >= 0.5
> treat as car_Audi=0 if car_Audi < 0.5
>
> or, it may be
>
> treat as car_Audi=1 if car_Audi > 0.5
> treat as car_Audi=0 if car_Audi <= 0.5
>
> (Forgot which one sklearn is using, but either way. it will be fine.)
>
> Best,
> Sebastian
>
>
> On Oct 4, 2019, at 1:44 PM, Nicolas Hug <niourf at gmail.com> wrote:
>
>
> But, decision tree is still mistaking one-hot-encoding as numerical input
> and split at 0.5. This is not right. Perhaps, I'm doing something wrong?
>
>
> You're not doing anything wrong, and neither is the tree. Trees don't
> support categorical variables in sklearn, so everything is treated as
> numerical.
>
> This is why we do one-hot-encoding: so that a set of numerical (one hot
> encoded) features can be treated as if they were just one categorical
> feature.
>
>
> Nicolas
> On 10/4/19 2:01 PM, C W wrote:
>
> Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, typo on my
> part.
>
> Looks like I did one-hot-encoding correctly. My new variable names are:
> car_Audi, car_BMW, etc.
>
> But, decision tree is still mistaking one-hot-encoding as numerical input
> and split at 0.5. This is not right. Perhaps, I'm doing something wrong?
>
> Is there a good toy example on the sklearn website? I am only see this:
> https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html
> .
>
> Thanks!
>
>
>
> On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka <
> mail at sebastianraschka.com> wrote:
>
>> Hi,
>>
>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1,
>> Audi=2) as numerical values, not category.The tree splits at 0.5 and 1.5
>>
>>
>> that's not a onehot encoding then.
>>
>> For an Audi datapoint, it should be
>>
>> BMW=0
>> Toyota=0
>> Audi=1
>>
>> for BMW
>>
>> BMW=1
>> Toyota=0
>> Audi=0
>>
>> and for Toyota
>>
>> BMW=0
>> Toyota=1
>> Audi=0
>>
>> The split threshold should then be at 0.5 for any of these features.
>>
>> Based on your email, I think you were assuming that the DT does the
>> one-hot encoding internally, which it doesn't. In practice, it is hard to
>> guess what is a nominal and what is a ordinal variable, so you have to do
>> the onehot encoding before you give the data to the decision tree.
>>
>> Best,
>> Sebastian
>>
>> On Oct 4, 2019, at 11:48 AM, C W <tmrsg11 at gmail.com> wrote:
>>
>> I'm getting some funny results. I am doing a regression decision tree,
>> the response variables are assigned to levels.
>>
>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1,
>> Audi=2) as numerical values, not category.
>>
>> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding wrong? How
>> does the sklearn know internally 0 vs. 1 is categorical, not numerical?
>>
>> In R for instance, you do as.factor(), which explicitly states the data
>> type.
>>
>> Thank you!
>>
>>
>> On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller <t3kcit at gmail.com>
>> wrote:
>>
>>>
>>>
>>> On 9/15/19 8:16 AM, Guillaume Lemaître wrote:
>>>
>>>
>>>
>>> On Sat, 14 Sep 2019 at 20:59, C W <tmrsg11 at gmail.com> wrote:
>>>
>>>> Thanks, Guillaume.
>>>> Column transformer looks pretty neat. I've also heard though, this
>>>> pipeline can be tedious to set up? Specifying what you want for every
>>>> feature is a pain.
>>>>
>>>
>>> It would be interesting for us which part of the pipeline is tedious to
>>> set up to know if we can improve something there.
>>> Do you mean, that you would like to automatically detect of which type
>>> of feature (categorical/numerical) and apply a
>>> default encoder/scaling such as discuss there:
>>> https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127
>>>
>>> IMO, one a user perspective, it would be cleaner in some cases at the
>>> cost of applying blindly a black box
>>> which might be dangerous.
>>>
>>> Also see
>>> https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor
>>> Which basically does that.
>>>
>>>
>>>
>>>
>>>>
>>>> Jaiver,
>>>> Actually, you guessed right. My real data has only one numerical
>>>> variable, looks more like this:
>>>>
>>>> Gender Date            Income  Car   Attendance
>>>> Male     2019/3/01   10000   BMW          Yes
>>>> Female 2019/5/02    9000   Toyota          No
>>>> Male     2019/7/15   12000    Audi           Yes
>>>>
>>>> I am predicting income using all other categorical variables. Maybe it
>>>> is catboost!
>>>>
>>>> Thanks,
>>>>
>>>> M
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, Sep 14, 2019 at 9:25 AM Javier López <jlopez at ende.cc>
>>>> <jlopez at ende.cc> wrote:
>>>>
>>>>> If you have datasets with many categorical features, and perhaps many
>>>>> categories, the tools in sklearn are quite limited,
>>>>> but there are alternative implementations of boosted trees that are
>>>>> designed with categorical features in mind. Take a look
>>>>> at catboost [1], which has an sklearn-compatible API.
>>>>>
>>>>> J
>>>>>
>>>>> [1] https://catboost.ai/
>>>>>
>>>>> On Sat, Sep 14, 2019 at 3:40 AM C W <tmrsg11 at gmail.com> wrote:
>>>>>
>>>>>> Hello all,
>>>>>> I'm very confused. Can the decision tree module handle both
>>>>>> continuous and categorical features in the dataset? In this case, it's just
>>>>>> CART (Classification and Regression Trees).
>>>>>>
>>>>>> For example,
>>>>>> Gender Age Income  Car   Attendance
>>>>>> Male     30   10000   BMW          Yes
>>>>>> Female 35     9000  Toyota          No
>>>>>> Male     50   12000    Audi           Yes
>>>>>>
>>>>>> According to the documentation
>>>>>> https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart,
>>>>>> it can not!
>>>>>>
>>>>>> It says: "scikit-learn implementation does not support categorical
>>>>>> variables for now".
>>>>>>
>>>>>> Is this true? If not, can someone point me to an example? If yes,
>>>>>> what do people do?
>>>>>>
>>>>>> Thank you very much!
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> scikit-learn mailing list
>>>>>> scikit-learn at python.org
>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>
>>>
>>> --
>>> Guillaume Lemaitre
>>> INRIA Saclay - Parietal team
>>> Center for Data Science Paris-Saclay
>>> https://glemaitre.github.io/
>>>
>>> _______________________________________________
>>> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20191004/ee2660dc/attachment-0001.html>