<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">Like Nicolas said, the 0.5 is just a workaround but will do the right thing on the one-hot encoded variables, here. You will find that the threshold is always at 0.5 for these variables. I.e., what it will do is to use the following conversion:<div class=""><br class=""></div><div class="">treat as car_Audi=1 if car_Audi >= 0.5</div><div class="">treat as car_Audi=0 if car_Audi < 0.5</div><div class=""><br class=""></div><div class="">or, it may be</div><div class=""><br class=""></div><div class=""><div class="">treat as car_Audi=1 if car_Audi > 0.5</div><div class="">treat as car_Audi=0 if car_Audi <= 0.5</div></div><div class=""><br class=""></div><div class="">(Forgot which one sklearn is using, but either way. it will be fine.)</div><div class=""><br class=""></div><div class="">Best,</div><div class="">Sebastian</div><div class=""><br class=""><div><br class=""><blockquote type="cite" class=""><div class="">On Oct 4, 2019, at 1:44 PM, Nicolas Hug <<a href="mailto:niourf@gmail.com" class="">niourf@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class="">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" class="">
<div class=""><div class="">
<br class="webkit-block-placeholder"></div><blockquote type="cite" class="">But, decision tree is still mistaking
one-hot-encoding as numerical input and split at 0.5. This is
not right. Perhaps, I'm doing something wrong?</blockquote><div class=""><br class="webkit-block-placeholder"></div><p class="">You're not doing anything wrong, and neither is the tree. Trees
don't support categorical variables in sklearn, so everything is
treated as numerical.</p><p class="">This is why we do one-hot-encoding: so that a set of numerical
(one hot encoded) features can be treated as if they were just one
categorical feature.</p><p class=""><br class="">
</p><p class="">Nicolas<br class="">
</p>
<div class="moz-cite-prefix">On 10/4/19 2:01 PM, C W wrote:<br class="">
</div>
<blockquote type="cite" cite="mid:CAE2FW2kFS0KdCWMkAdKcqd_hiHGe98HKvvgjx24H4dsF05iJxQ@mail.gmail.com" class="">
<meta http-equiv="content-type" content="text/html; charset=UTF-8" class="">
<div dir="ltr" class="">
<div class="">Yes, you are right. it was 0.5 and 0.5 for split, not 1.5.
So, typo on my part.<br class="">
</div>
<div class=""><br class="">
</div>
<div class="">Looks like I did one-hot-encoding correctly. My new
variable names are: car_Audi, car_BMW, etc.<br class="">
</div>
<div class=""><br class="">
</div>
<div class="">But, decision tree is still mistaking one-hot-encoding as
numerical input and split at 0.5. This is not right. Perhaps,
I'm doing something wrong?<br class="">
</div>
<div class=""><br class="">
</div>
<div class="">Is there a good toy example on the sklearn website? I am
only see this: <a href="https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html" moz-do-not-send="true" class="">https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html</a>.</div>
<div class=""><br class="">
</div>
<div class="">Thanks!<br class="">
</div>
<div class=""><br class="">
</div>
<div class=""><br class="">
</div>
</div>
<br class="">
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Fri, Oct 4, 2019 at 1:28 PM
Sebastian Raschka <<a href="mailto:mail@sebastianraschka.com" moz-do-not-send="true" class="">mail@sebastianraschka.com</a>>
wrote:<br class="">
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div style="overflow-wrap: break-word;" class="">Hi,
<div class=""><br class="">
</div>
<div class="">
<blockquote type="cite" class="">
<div dir="ltr" class="">
<div class="">The funny part is: the tree is taking
one-hot-encoding (BMW=0, Toyota=1, Audi=2) as
numerical values, not category.The tree splits at
0.5 and 1.5</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
that's not a onehot encoding then.</div>
<div class=""><br class="">
</div>
<div class="">For an Audi datapoint, it should be</div>
<div class=""><br class="">
</div>
<div class="">BMW=0</div>
<div class="">Toyota=0</div>
<div class="">Audi=1</div>
<div class=""><br class="">
</div>
<div class="">for BMW</div>
<div class=""><br class="">
</div>
<div class="">
<div class="">BMW=1</div>
<div class="">Toyota=0</div>
<div class="">Audi=0</div>
</div>
<div class=""><br class="">
</div>
<div class="">and for Toyota</div>
<div class=""><br class="">
</div>
<div class="">
<div class="">BMW=0</div>
<div class="">Toyota=1</div>
<div class="">Audi=0</div>
</div>
<div class=""><br class="">
</div>
<div class="">The split threshold should then be at 0.5 for any of
these features.</div>
<div class=""><br class="">
</div>
<div class="">Based on your email, I think you were assuming that the
DT does the one-hot encoding internally, which it doesn't.
In practice, it is hard to guess what is a nominal and
what is a ordinal variable, so you have to do the onehot
encoding before you give the data to the decision tree.</div>
<div class=""><br class="">
</div>
<div class="">Best,</div>
<div class="">Sebastian</div>
<div class="">
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">On Oct 4, 2019, at 11:48 AM, C W <<a href="mailto:tmrsg11@gmail.com" target="_blank" moz-do-not-send="true" class="">tmrsg11@gmail.com</a>>
wrote:</div>
<br class="">
<div class="">
<div dir="ltr" class="">
<div class="">I'm getting some funny results. I am doing a
regression decision tree, the response variables
are assigned to levels.<br class="">
</div>
<div class=""><br class="">
</div>
<div class="">The funny part is: the tree is taking
one-hot-encoding (BMW=0, Toyota=1, Audi=2) as
numerical values, not category.</div>
<div class=""><br class="">
</div>
<div class="">The tree splits at 0.5 and 1.5. Am I doing
one-hot-encoding wrong? How does the sklearn
know internally 0 vs. 1 is categorical, not
numerical? <br class="">
</div>
<div class=""><br class="">
</div>
<div class="">In R for instance, you do as.factor(), which
explicitly states the data type.</div>
<div class=""><br class="">
</div>
<div class="">Thank you!</div>
<div class=""><br class="">
</div>
</div>
<br class="">
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Wed, Sep 18,
2019 at 11:13 AM Andreas Mueller <<a href="mailto:t3kcit@gmail.com" target="_blank" moz-do-not-send="true" class="">t3kcit@gmail.com</a>>
wrote:<br class="">
</div>
<blockquote class="gmail_quote" style="margin:0px
0px 0px 0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF" class=""> <br class="">
<br class="">
<div class="">On 9/15/19 8:16 AM, Guillaume Lemaître
wrote:<br class="">
</div>
<blockquote type="cite" class="">
<div dir="ltr" class="">
<div dir="ltr" class=""><br class="">
</div>
<br class="">
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On
Sat, 14 Sep 2019 at 20:59, C W <<a href="mailto:tmrsg11@gmail.com" target="_blank" moz-do-not-send="true" class="">tmrsg11@gmail.com</a>>
wrote:<br class="">
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div dir="ltr" class="">Thanks, Guillaume.
<div class="">Column transformer looks pretty
neat. I've also heard though, this
pipeline can be tedious to set up?
Specifying what you want for every
feature is a pain.</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
<div class="">It would be interesting for us
which part of the pipeline is tedious
to set up to know if we can improve
something there.</div>
<div class="">Do you mean, that you would like to
automatically detect of which type of
feature (categorical/numerical) and
apply a</div>
<div class="">default encoder/scaling such as
discuss there: <a href="https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127" target="_blank" moz-do-not-send="true" class="">https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127</a></div>
<div class=""><br class="">
</div>
<div class="">IMO, one a user perspective, it
would be cleaner in some cases at the
cost of applying blindly a black box</div>
<div class="">which might be dangerous.<br class="">
</div>
</div>
</div>
</blockquote>
Also see <a href="https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor" target="_blank" moz-do-not-send="true" class="">https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor</a><br class="">
Which basically does that.<br class="">
<br class="">
<br class="">
<blockquote type="cite" class="">
<div dir="ltr" class="">
<div class="gmail_quote">
<div class=""> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div dir="ltr" class="">
<div class=""><br class="">
</div>
<div class="">Jaiver,</div>
<div class="">Actually, you guessed right. My
real data has only one numerical
variable, looks more like this:</div>
<div class=""><br class="">
</div>
<div class="">
<div class="">Gender Date
Income Car Attendance<br class="">
</div>
<div class="">Male 2019/3/01 10000
BMW Yes<br class="">
</div>
<div class="">Female 2019/5/02 9000
Toyota No<br class="">
</div>
<div class="">Male 2019/7/15 12000
Audi Yes</div>
</div>
<div class=""><br class="">
</div>
<div class="">I am predicting income using
all other categorical variables.
Maybe it is catboost!</div>
<div class=""><br class="">
</div>
<div class="">Thanks,</div>
<div class=""><br class="">
</div>
<div class="">M</div>
<div class=""><br class="">
</div>
<div class=""><br class="">
</div>
<div class=""><br class="">
<div class=""><br class="">
</div>
<div class=""><br class="">
<table style="border-collapse:collapse;margin-top:0px;width:auto;font-family:Roboto,RobotoDraft,Helvetica,Arial,sans-serif;font-size:14px;letter-spacing:0.2px;display:block" cellpadding="0" class="">
</table>
</div>
</div>
</div>
<br class="">
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On
Sat, Sep 14, 2019 at 9:25 AM
Javier López <a href="mailto:jlopez@ende.cc" target="_blank" moz-do-not-send="true" class=""><jlopez@ende.cc></a>
wrote:<br class="">
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div dir="ltr" class="">If you have
datasets with many categorical
features, and perhaps many
categories, the tools in sklearn
are quite limited,
<div class="">but there are alternative
implementations of boosted
trees that are designed with
categorical features in mind.
Take a look</div>
<div class="">at catboost [1], which has
an sklearn-compatible API.</div>
<div class=""><br class="">
</div>
<div class="">J</div>
<div class=""><br class="">
</div>
<div class="">[1] <a href="https://catboost.ai/" target="_blank" moz-do-not-send="true" class="">https://catboost.ai/</a></div>
</div>
<br class="">
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Sat, Sep
14, 2019 at 3:40 AM C W <<a href="mailto:tmrsg11@gmail.com" target="_blank" moz-do-not-send="true" class="">tmrsg11@gmail.com</a>>
wrote:<br class="">
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div dir="ltr" class="">
<div class="">Hello all,</div>
<div class="">I'm very confused. Can
the decision tree module
handle both continuous and
categorical features in
the dataset? In this case,
it's just CART
(Classification and
Regression Trees).<br class="">
</div>
<div class=""><br class="">
</div>
<div class="">For example,</div>
<div class="">Gender Age Income
Car Attendance<br class="">
</div>
<div class="">Male 30 10000
BMW Yes<br class="">
</div>
<div class="">Female 35 9000
Toyota No<br class="">
</div>
<div class="">Male 50 12000
Audi Yes<br class="">
</div>
<div class=""><br class="">
</div>
<div class="">According to the
documentation <a href="https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart" target="_blank" moz-do-not-send="true" class="">https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart</a>,
it can not! <br class="">
</div>
<div class=""><br class="">
</div>
<div class="">It says: "scikit-learn
implementation does not
support categorical
variables for now". <br class="">
</div>
<div class=""><br class="">
</div>
<div class="">Is this true? If not,
can someone point me to an
example? If yes, what do
people do?<br class="">
</div>
<div class=""><br class="">
</div>
<div class="">Thank you very much!<br class="">
</div>
<div class=""><br class="">
</div>
<div class=""><br class="">
</div>
<div class=""><br class="">
</div>
</div>
_______________________________________________<br class="">
scikit-learn mailing list<br class="">
<a href="mailto:scikit-learn@python.org" target="_blank" moz-do-not-send="true" class="">scikit-learn@python.org</a><br class="">
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank" moz-do-not-send="true" class="">https://mail.python.org/mailman/listinfo/scikit-learn</a><br class="">
</blockquote>
</div>
_______________________________________________<br class="">
scikit-learn mailing list<br class="">
<a href="mailto:scikit-learn@python.org" target="_blank" moz-do-not-send="true" class="">scikit-learn@python.org</a><br class="">
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank" moz-do-not-send="true" class="">https://mail.python.org/mailman/listinfo/scikit-learn</a><br class="">
</blockquote>
</div>
_______________________________________________<br class="">
scikit-learn mailing list<br class="">
<a href="mailto:scikit-learn@python.org" target="_blank" moz-do-not-send="true" class="">scikit-learn@python.org</a><br class="">
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank" moz-do-not-send="true" class="">https://mail.python.org/mailman/listinfo/scikit-learn</a><br class="">
</blockquote>
</div>
<br clear="all" class="">
<br class="">
-- <br class="">
<div dir="ltr" class="">
<div dir="ltr" class="">
<div class="">
<div dir="ltr" class="">
<div class="">
<div dir="ltr" class="">
<div class="">Guillaume Lemaitre<br class="">
INRIA Saclay - Parietal team<br class="">
Center for Data Science
Paris-Saclay<br class="">
<a href="https://glemaitre.github.io/" target="_blank" moz-do-not-send="true" class="">https://glemaitre.github.io/</a></div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<br class="">
<fieldset class=""></fieldset>
<pre class="">_______________________________________________
scikit-learn mailing list
<a href="mailto:scikit-learn@python.org" target="_blank" moz-do-not-send="true" class="">scikit-learn@python.org</a>
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" target="_blank" moz-do-not-send="true" class="">https://mail.python.org/mailman/listinfo/scikit-learn</a>
</pre>
</blockquote>
<br class="">
</div>
_______________________________________________<br class="">
scikit-learn mailing list<br class="">
<a href="mailto:scikit-learn@python.org" target="_blank" moz-do-not-send="true" class="">scikit-learn@python.org</a><br class="">
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank" moz-do-not-send="true" class="">https://mail.python.org/mailman/listinfo/scikit-learn</a><br class="">
</blockquote>
</div>
_______________________________________________<br class="">
scikit-learn mailing list<br class="">
<a href="mailto:scikit-learn@python.org" target="_blank" moz-do-not-send="true" class="">scikit-learn@python.org</a><br class="">
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" target="_blank" moz-do-not-send="true" class="">https://mail.python.org/mailman/listinfo/scikit-learn</a><br class="">
</div>
</blockquote>
</div>
<br class="">
</div>
</div>
_______________________________________________<br class="">
scikit-learn mailing list<br class="">
<a href="mailto:scikit-learn@python.org" target="_blank" moz-do-not-send="true" class="">scikit-learn@python.org</a><br class="">
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank" moz-do-not-send="true" class="">https://mail.python.org/mailman/listinfo/scikit-learn</a><br class="">
</blockquote>
</div>
<br class="">
<fieldset class="mimeAttachmentHeader"></fieldset>
<pre class="moz-quote-pre" wrap="">_______________________________________________
scikit-learn mailing list
<a class="moz-txt-link-abbreviated" href="mailto:scikit-learn@python.org">scikit-learn@python.org</a>
<a class="moz-txt-link-freetext" href="https://mail.python.org/mailman/listinfo/scikit-learn">https://mail.python.org/mailman/listinfo/scikit-learn</a>
</pre>
</blockquote>
</div>
_______________________________________________<br class="">scikit-learn mailing list<br class=""><a href="mailto:scikit-learn@python.org" class="">scikit-learn@python.org</a><br class="">https://mail.python.org/mailman/listinfo/scikit-learn<br class=""></div></blockquote></div><br class=""></div></body></html>