[scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?
Sebastian Raschka
mail at sebastianraschka.com
Sun Oct 6 10:40:09 EDT 2019
Sure, I just ran an example I made with graphviz via plot_tree, and it looks like there's an issue with overlapping boxes if you use class (and/or feature) names. I made a reproducible example here so that you can take a look:
https://github.com/rasbt/bugreport/blob/master/scikit-learn/plot_tree/tree-demo-1.ipynb
Happy to add this to the sklearn issue list if there's no issue filed for that yet.
Best,
Sebastian
> On Oct 6, 2019, at 9:10 AM, Andreas Mueller <t3kcit at gmail.com> wrote:
>
>
>
> On 10/4/19 11:28 PM, Sebastian Raschka wrote:
>> The docs show a way such that you don't need to write it as png file using tree.plot_tree:
>> https://scikit-learn.org/stable/modules/tree.html#classification
>>
>> I don't remember why, but I think I had problems with that in the past (I think it didn't look so nice visually, but don't remember), which is why I still stick to graphviz.
> Can you give me examples that don't look as nice? I would love to improve it.
>
>> For my use cases, it's not much hassle -- it used to be a bit of a hassle to get GraphViz working, but now you can do
>>
>> conda install pydotplus
>> conda install graphviz
>>
>> Coincidentally, I just made an example for a lecture I was teaching on Tue: https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/06_trees/code/06-trees_demo.ipynb
>>
>> Best,
>> Sebastian
>>
>>
>>> On Oct 4, 2019, at 10:09 PM, C W <tmrsg11 at gmail.com> wrote:
>>>
>>> On a separate note, what do you use for plotting?
>>>
>>> I found graphviz, but you have to first save it as a png on your computer. That's a lot work for just one plot. Is there something like a matplotlib?
>>>
>>> Thanks!
>>>
>>> On Fri, Oct 4, 2019 at 9:42 PM Sebastian Raschka <mail at sebastianraschka.com> wrote:
>>> Yeah, think of it more as a computational workaround for achieving the same thing more efficiently (although it looks inelegant/weird)-- something like that wouldn't be mentioned in textbooks.
>>>
>>> Best,
>>> Sebastian
>>>
>>>> On Oct 4, 2019, at 6:33 PM, C W <tmrsg11 at gmail.com> wrote:
>>>>
>>>> Thanks Sebastian, I think I get it.
>>>>
>>>> It's just have never seen it this way. Quite different from what I'm used in Elements of Statistical Learning.
>>>>
>>>> On Fri, Oct 4, 2019 at 7:13 PM Sebastian Raschka <mail at sebastianraschka.com> wrote:
>>>> Not sure if there's a website for that. In any case, to explain this differently, as discussed earlier sklearn assumes continuous features for decision trees. So, it will use a binary threshold for splitting along a feature attribute. In other words, it cannot do sth like
>>>>
>>>> if x == 1 then right child node
>>>> else left child node
>>>>
>>>> Instead, what it does is
>>>>
>>>> if x >= 0.5 then right child node
>>>> else left child node
>>>>
>>>> These are basically equivalent as you can see when you just plug in values 0 and 1 for x.
>>>>
>>>> Best,
>>>> Sebastian
>>>>
>>>>> On Oct 4, 2019, at 5:34 PM, C W <tmrsg11 at gmail.com> wrote:
>>>>>
>>>>> I don't understand your answer.
>>>>>
>>>>> Why after one-hot-encoding it still outputs greater than 0.5 or less than? Does sklearn website have a working example on categorical input?
>>>>>
>>>>> Thanks!
>>>>>
>>>>> On Fri, Oct 4, 2019 at 3:48 PM Sebastian Raschka <mail at sebastianraschka.com> wrote:
>>>>> Like Nicolas said, the 0.5 is just a workaround but will do the right thing on the one-hot encoded variables, here. You will find that the threshold is always at 0.5 for these variables. I.e., what it will do is to use the following conversion:
>>>>>
>>>>> treat as car_Audi=1 if car_Audi >= 0.5
>>>>> treat as car_Audi=0 if car_Audi < 0.5
>>>>>
>>>>> or, it may be
>>>>>
>>>>> treat as car_Audi=1 if car_Audi > 0.5
>>>>> treat as car_Audi=0 if car_Audi <= 0.5
>>>>>
>>>>> (Forgot which one sklearn is using, but either way. it will be fine.)
>>>>>
>>>>> Best,
>>>>> Sebastian
>>>>>
>>>>>
>>>>>> On Oct 4, 2019, at 1:44 PM, Nicolas Hug <niourf at gmail.com> wrote:
>>>>>>
>>>>>>
>>>>>>> But, decision tree is still mistaking one-hot-encoding as numerical input and split at 0.5. This is not right. Perhaps, I'm doing something wrong?
>>>>>> You're not doing anything wrong, and neither is the tree. Trees don't support categorical variables in sklearn, so everything is treated as numerical.
>>>>>>
>>>>>> This is why we do one-hot-encoding: so that a set of numerical (one hot encoded) features can be treated as if they were just one categorical feature.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Nicolas
>>>>>>
>>>>>> On 10/4/19 2:01 PM, C W wrote:
>>>>>>> Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, typo on my part.
>>>>>>>
>>>>>>> Looks like I did one-hot-encoding correctly. My new variable names are: car_Audi, car_BMW, etc.
>>>>>>>
>>>>>>> But, decision tree is still mistaking one-hot-encoding as numerical input and split at 0.5. This is not right. Perhaps, I'm doing something wrong?
>>>>>>>
>>>>>>> Is there a good toy example on the sklearn website? I am only see this: https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html.
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka <mail at sebastianraschka.com> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, Audi=2) as numerical values, not category.The tree splits at 0.5 and 1.5
>>>>>>> that's not a onehot encoding then.
>>>>>>>
>>>>>>> For an Audi datapoint, it should be
>>>>>>>
>>>>>>> BMW=0
>>>>>>> Toyota=0
>>>>>>> Audi=1
>>>>>>>
>>>>>>> for BMW
>>>>>>>
>>>>>>> BMW=1
>>>>>>> Toyota=0
>>>>>>> Audi=0
>>>>>>>
>>>>>>> and for Toyota
>>>>>>>
>>>>>>> BMW=0
>>>>>>> Toyota=1
>>>>>>> Audi=0
>>>>>>>
>>>>>>> The split threshold should then be at 0.5 for any of these features.
>>>>>>>
>>>>>>> Based on your email, I think you were assuming that the DT does the one-hot encoding internally, which it doesn't. In practice, it is hard to guess what is a nominal and what is a ordinal variable, so you have to do the onehot encoding before you give the data to the decision tree.
>>>>>>>
>>>>>>> Best,
>>>>>>> Sebastian
>>>>>>>
>>>>>>>> On Oct 4, 2019, at 11:48 AM, C W <tmrsg11 at gmail.com> wrote:
>>>>>>>>
>>>>>>>> I'm getting some funny results. I am doing a regression decision tree, the response variables are assigned to levels.
>>>>>>>>
>>>>>>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, Audi=2) as numerical values, not category.
>>>>>>>>
>>>>>>>> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding wrong? How does the sklearn know internally 0 vs. 1 is categorical, not numerical?
>>>>>>>>
>>>>>>>> In R for instance, you do as.factor(), which explicitly states the data type.
>>>>>>>>
>>>>>>>> Thank you!
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller <t3kcit at gmail.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 9/15/19 8:16 AM, Guillaume Lemaître wrote:
>>>>>>>>>
>>>>>>>>> On Sat, 14 Sep 2019 at 20:59, C W <tmrsg11 at gmail.com> wrote:
>>>>>>>>> Thanks, Guillaume.
>>>>>>>>> Column transformer looks pretty neat. I've also heard though, this pipeline can be tedious to set up? Specifying what you want for every feature is a pain.
>>>>>>>>>
>>>>>>>>> It would be interesting for us which part of the pipeline is tedious to set up to know if we can improve something there.
>>>>>>>>> Do you mean, that you would like to automatically detect of which type of feature (categorical/numerical) and apply a
>>>>>>>>> default encoder/scaling such as discuss there: https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127
>>>>>>>>>
>>>>>>>>> IMO, one a user perspective, it would be cleaner in some cases at the cost of applying blindly a black box
>>>>>>>>> which might be dangerous.
>>>>>>>> Also see https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor
>>>>>>>> Which basically does that.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Jaiver,
>>>>>>>>> Actually, you guessed right. My real data has only one numerical variable, looks more like this:
>>>>>>>>>
>>>>>>>>> Gender Date Income Car Attendance
>>>>>>>>> Male 2019/3/01 10000 BMW Yes
>>>>>>>>> Female 2019/5/02 9000 Toyota No
>>>>>>>>> Male 2019/7/15 12000 Audi Yes
>>>>>>>>>
>>>>>>>>> I am predicting income using all other categorical variables. Maybe it is catboost!
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> M
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Sep 14, 2019 at 9:25 AM Javier López <jlopez at ende.cc> wrote:
>>>>>>>>> If you have datasets with many categorical features, and perhaps many categories, the tools in sklearn are quite limited,
>>>>>>>>> but there are alternative implementations of boosted trees that are designed with categorical features in mind. Take a look
>>>>>>>>> at catboost [1], which has an sklearn-compatible API.
>>>>>>>>>
>>>>>>>>> J
>>>>>>>>>
>>>>>>>>> [1] https://catboost.ai/
>>>>>>>>>
>>>>>>>>> On Sat, Sep 14, 2019 at 3:40 AM C W <tmrsg11 at gmail.com> wrote:
>>>>>>>>> Hello all,
>>>>>>>>> I'm very confused. Can the decision tree module handle both continuous and categorical features in the dataset? In this case, it's just CART (Classification and Regression Trees).
>>>>>>>>>
>>>>>>>>> For example,
>>>>>>>>> Gender Age Income Car Attendance
>>>>>>>>> Male 30 10000 BMW Yes
>>>>>>>>> Female 35 9000 Toyota No
>>>>>>>>> Male 50 12000 Audi Yes
>>>>>>>>>
>>>>>>>>> According to the documentation https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart, it can not!
>>>>>>>>>
>>>>>>>>> It says: "scikit-learn implementation does not support categorical variables for now".
>>>>>>>>>
>>>>>>>>> Is this true? If not, can someone point me to an example? If yes, what do people do?
>>>>>>>>>
>>>>>>>>> Thank you very much!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> scikit-learn mailing list
>>>>>>>>> scikit-learn at python.org
>>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>>>> _______________________________________________
>>>>>>>>> scikit-learn mailing list
>>>>>>>>> scikit-learn at python.org
>>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>>>> _______________________________________________
>>>>>>>>> scikit-learn mailing list
>>>>>>>>> scikit-learn at python.org
>>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Guillaume Lemaitre
>>>>>>>>> INRIA Saclay - Parietal team
>>>>>>>>> Center for Data Science Paris-Saclay
>>>>>>>>> https://glemaitre.github.io/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> scikit-learn mailing list
>>>>>>>>>
>>>>>>>>> scikit-learn at python.org
>>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>>> _______________________________________________
>>>>>>>> scikit-learn mailing list
>>>>>>>> scikit-learn at python.org
>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>>> _______________________________________________
>>>>>>>> scikit-learn mailing list
>>>>>>>> scikit-learn at python.org
>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>> _______________________________________________
>>>>>>> scikit-learn mailing list
>>>>>>> scikit-learn at python.org
>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> scikit-learn mailing list
>>>>>>>
>>>>>>> scikit-learn at python.org
>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>> _______________________________________________
>>>>>> scikit-learn mailing list
>>>>>> scikit-learn at python.org
>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
More information about the scikit-learn
mailing list