<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">Like Nicolas said, the 0.5 is just a workaround but will do the right thing on the one-hot encoded variables, here. You will find that the threshold is always at 0.5 for these variables. I.e., what it will do is to use the following conversion:<div class=""><br class=""></div><div class="">treat as car_Audi=1 if car_Audi >= 0.5</div><div class="">treat as car_Audi=0 if car_Audi < 0.5</div><div class=""><br class=""></div><div class="">or, it may be</div><div class=""><br class=""></div><div class=""><div class="">treat as car_Audi=1 if car_Audi > 0.5</div><div class="">treat as car_Audi=0 if car_Audi <= 0.5</div></div><div class=""><br class=""></div><div class="">(Forgot which one sklearn is using, but either way. it will be fine.)</div><div class=""><br class=""></div><div class="">Best,</div><div class="">Sebastian</div><div class=""><br class=""><div><br class=""><blockquote type="cite" class=""><div class="">On Oct 4, 2019, at 1:44 PM, Nicolas Hug <<a href="mailto:niourf@gmail.com" class="">niourf@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class="">
  
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" class="">
  
  <div class=""><div class="">
      <br class="webkit-block-placeholder"></div><blockquote type="cite" class="">But, decision tree is still mistaking
        one-hot-encoding as numerical input and split at 0.5. This is
        not right. Perhaps, I'm doing something wrong?</blockquote><div class=""><br class="webkit-block-placeholder"></div><p class="">You're not doing anything wrong, and neither is the tree. Trees
      don't support categorical variables in sklearn, so everything is
      treated as numerical.</p><p class="">This is why we do one-hot-encoding: so that a set of numerical
      (one hot encoded) features can be treated as if they were just one
      categorical feature.</p><p class=""><br class="">
    </p><p class="">Nicolas<br class="">
    </p>
    <div class="moz-cite-prefix">On 10/4/19 2:01 PM, C W wrote:<br class="">
    </div>
    <blockquote type="cite" cite="mid:CAE2FW2kFS0KdCWMkAdKcqd_hiHGe98HKvvgjx24H4dsF05iJxQ@mail.gmail.com" class="">
      <meta http-equiv="content-type" content="text/html; charset=UTF-8" class="">
      <div dir="ltr" class="">
        <div class="">Yes, you are right. it was 0.5 and 0.5 for split, not 1.5.
          So, typo on my part.<br class="">
        </div>
        <div class=""><br class="">
        </div>
        <div class="">Looks like I did one-hot-encoding correctly. My new
          variable names are: car_Audi, car_BMW, etc.<br class="">
        </div>
        <div class=""><br class="">
        </div>
        <div class="">But, decision tree is still mistaking one-hot-encoding as
          numerical input and split at 0.5. This is not right. Perhaps,
          I'm doing something wrong?<br class="">
        </div>
        <div class=""><br class="">
        </div>
        <div class="">Is there a good toy example on the sklearn website? I am
          only see this: <a href="https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html" moz-do-not-send="true" class="">https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html</a>.</div>
        <div class=""><br class="">
        </div>
        <div class="">Thanks!<br class="">
        </div>
        <div class=""><br class="">
        </div>
        <div class=""><br class="">
        </div>
      </div>
      <br class="">
      <div class="gmail_quote">
        <div dir="ltr" class="gmail_attr">On Fri, Oct 4, 2019 at 1:28 PM
          Sebastian Raschka <<a href="mailto:mail@sebastianraschka.com" moz-do-not-send="true" class="">mail@sebastianraschka.com</a>>
          wrote:<br class="">
        </div>
        <blockquote class="gmail_quote" style="margin:0px 0px 0px
          0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
          <div style="overflow-wrap: break-word;" class="">Hi,
            <div class=""><br class="">
            </div>
            <div class="">
              <blockquote type="cite" class="">
                <div dir="ltr" class="">
                  <div class="">The funny part is: the tree is taking
                    one-hot-encoding (BMW=0, Toyota=1, Audi=2) as
                    numerical values, not category.The tree splits at
                    0.5 and 1.5</div>
                </div>
              </blockquote>
              <div class=""><br class="">
              </div>
              that's not a onehot encoding then.</div>
            <div class=""><br class="">
            </div>
            <div class="">For an Audi datapoint, it should be</div>
            <div class=""><br class="">
            </div>
            <div class="">BMW=0</div>
            <div class="">Toyota=0</div>
            <div class="">Audi=1</div>
            <div class=""><br class="">
            </div>
            <div class="">for BMW</div>
            <div class=""><br class="">
            </div>
            <div class="">
              <div class="">BMW=1</div>
              <div class="">Toyota=0</div>
              <div class="">Audi=0</div>
            </div>
            <div class=""><br class="">
            </div>
            <div class="">and for Toyota</div>
            <div class=""><br class="">
            </div>
            <div class="">
              <div class="">BMW=0</div>
              <div class="">Toyota=1</div>
              <div class="">Audi=0</div>
            </div>
            <div class=""><br class="">
            </div>
            <div class="">The split threshold should then be at 0.5 for any of
              these features.</div>
            <div class=""><br class="">
            </div>
            <div class="">Based on your email, I think you were assuming that the
              DT does the one-hot encoding internally, which it doesn't.
              In practice, it is hard to guess what is a nominal and
              what is a ordinal variable, so you have to do the onehot
              encoding before you give the data to the decision tree.</div>
            <div class=""><br class="">
            </div>
            <div class="">Best,</div>
            <div class="">Sebastian</div>
            <div class="">
              <div class=""><br class="">
                <blockquote type="cite" class="">
                  <div class="">On Oct 4, 2019, at 11:48 AM, C W <<a href="mailto:tmrsg11@gmail.com" target="_blank" moz-do-not-send="true" class="">tmrsg11@gmail.com</a>>
                    wrote:</div>
                  <br class="">
                  <div class="">
                    <div dir="ltr" class="">
                      <div class="">I'm getting some funny results. I am doing a
                        regression decision tree, the response variables
                        are assigned to levels.<br class="">
                      </div>
                      <div class=""><br class="">
                      </div>
                      <div class="">The funny part is: the tree is taking
                        one-hot-encoding (BMW=0, Toyota=1, Audi=2) as
                        numerical values, not category.</div>
                      <div class=""><br class="">
                      </div>
                      <div class="">The tree splits at 0.5 and 1.5. Am I doing
                        one-hot-encoding wrong? How does the sklearn
                        know internally 0 vs. 1 is categorical, not
                        numerical? <br class="">
                      </div>
                      <div class=""><br class="">
                      </div>
                      <div class="">In R for instance, you do as.factor(), which
                        explicitly states the data type.</div>
                      <div class=""><br class="">
                      </div>
                      <div class="">Thank you!</div>
                      <div class=""><br class="">
                      </div>
                    </div>
                    <br class="">
                    <div class="gmail_quote">
                      <div dir="ltr" class="gmail_attr">On Wed, Sep 18,
                        2019 at 11:13 AM Andreas Mueller <<a href="mailto:t3kcit@gmail.com" target="_blank" moz-do-not-send="true" class="">t3kcit@gmail.com</a>>
                        wrote:<br class="">
                      </div>
                      <blockquote class="gmail_quote" style="margin:0px
                        0px 0px 0.8ex;border-left:1px solid
                        rgb(204,204,204);padding-left:1ex">
                        <div bgcolor="#FFFFFF" class=""> <br class="">
                          <br class="">
                          <div class="">On 9/15/19 8:16 AM, Guillaume Lemaître
                            wrote:<br class="">
                          </div>
                          <blockquote type="cite" class="">
                            <div dir="ltr" class="">
                              <div dir="ltr" class=""><br class="">
                              </div>
                              <br class="">
                              <div class="gmail_quote">
                                <div dir="ltr" class="gmail_attr">On
                                  Sat, 14 Sep 2019 at 20:59, C W <<a href="mailto:tmrsg11@gmail.com" target="_blank" moz-do-not-send="true" class="">tmrsg11@gmail.com</a>>
                                  wrote:<br class="">
                                </div>
                                <blockquote class="gmail_quote" style="margin:0px 0px 0px
                                  0.8ex;border-left:1px solid
                                  rgb(204,204,204);padding-left:1ex">
                                  <div dir="ltr" class="">Thanks, Guillaume. 
                                    <div class="">Column transformer looks pretty
                                      neat. I've also heard though, this
                                      pipeline can be tedious to set up?
                                      Specifying what you want for every
                                      feature is a pain.</div>
                                  </div>
                                </blockquote>
                                <div class=""><br class="">
                                </div>
                                <div class="">It would be interesting for us
                                  which part of the pipeline is tedious
                                  to set up to know if we can improve
                                  something there.</div>
                                <div class="">Do you mean, that you would like to
                                  automatically detect of which type of
                                  feature (categorical/numerical) and
                                  apply a</div>
                                <div class="">default encoder/scaling such as
                                  discuss there: <a href="https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127" target="_blank" moz-do-not-send="true" class="">https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127</a></div>
                                <div class=""><br class="">
                                </div>
                                <div class="">IMO, one a user perspective, it
                                  would be cleaner in some cases at the
                                  cost of applying blindly a black box</div>
                                <div class="">which might be dangerous.<br class="">
                                </div>
                              </div>
                            </div>
                          </blockquote>
                          Also see <a href="https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor" target="_blank" moz-do-not-send="true" class="">https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor</a><br class="">
                          Which basically does that.<br class="">
                          <br class="">
                          <br class="">
                          <blockquote type="cite" class="">
                            <div dir="ltr" class="">
                              <div class="gmail_quote">
                                <div class=""> </div>
                                <blockquote class="gmail_quote" style="margin:0px 0px 0px
                                  0.8ex;border-left:1px solid
                                  rgb(204,204,204);padding-left:1ex">
                                  <div dir="ltr" class="">
                                    <div class=""><br class="">
                                    </div>
                                    <div class="">Jaiver,</div>
                                    <div class="">Actually, you guessed right. My
                                      real data has only one numerical
                                      variable, looks more like this:</div>
                                    <div class=""><br class="">
                                    </div>
                                    <div class="">
                                      <div class="">Gender Date           
                                        Income  Car   Attendance<br class="">
                                      </div>
                                      <div class="">Male     2019/3/01   10000  
                                        BMW          Yes<br class="">
                                      </div>
                                      <div class="">Female 2019/5/02    9000 
                                         Toyota          No<br class="">
                                      </div>
                                      <div class="">Male     2019/7/15   12000   
                                        Audi           Yes</div>
                                    </div>
                                    <div class=""><br class="">
                                    </div>
                                    <div class="">I am predicting income using
                                      all other categorical variables.
                                      Maybe it is catboost!</div>
                                    <div class=""><br class="">
                                    </div>
                                    <div class="">Thanks,</div>
                                    <div class=""><br class="">
                                    </div>
                                    <div class="">M</div>
                                    <div class=""><br class="">
                                    </div>
                                    <div class=""><br class="">
                                    </div>
                                    <div class=""><br class="">
                                      <div class=""><br class="">
                                      </div>
                                      <div class=""><br class="">
                                        <table style="border-collapse:collapse;margin-top:0px;width:auto;font-family:Roboto,RobotoDraft,Helvetica,Arial,sans-serif;font-size:14px;letter-spacing:0.2px;display:block" cellpadding="0" class="">
                                        </table>
                                      </div>
                                    </div>
                                  </div>
                                  <br class="">
                                  <div class="gmail_quote">
                                    <div dir="ltr" class="gmail_attr">On
                                      Sat, Sep 14, 2019 at 9:25 AM
                                      Javier López <a href="mailto:jlopez@ende.cc" target="_blank" moz-do-not-send="true" class=""><jlopez@ende.cc></a>
                                      wrote:<br class="">
                                    </div>
                                    <blockquote class="gmail_quote" style="margin:0px 0px 0px
                                      0.8ex;border-left:1px solid
                                      rgb(204,204,204);padding-left:1ex">
                                      <div dir="ltr" class="">If you have
                                        datasets with many categorical
                                        features, and perhaps many
                                        categories, the tools in sklearn
                                        are quite limited, 
                                        <div class="">but there are alternative
                                          implementations of boosted
                                          trees that are designed with
                                          categorical features in mind.
                                          Take a look</div>
                                        <div class="">at catboost [1], which has
                                          an sklearn-compatible API.</div>
                                        <div class=""><br class="">
                                        </div>
                                        <div class="">J</div>
                                        <div class=""><br class="">
                                        </div>
                                        <div class="">[1] <a href="https://catboost.ai/" target="_blank" moz-do-not-send="true" class="">https://catboost.ai/</a></div>
                                      </div>
                                      <br class="">
                                      <div class="gmail_quote">
                                        <div dir="ltr" class="gmail_attr">On Sat, Sep
                                          14, 2019 at 3:40 AM C W <<a href="mailto:tmrsg11@gmail.com" target="_blank" moz-do-not-send="true" class="">tmrsg11@gmail.com</a>>
                                          wrote:<br class="">
                                        </div>
                                        <blockquote class="gmail_quote" style="margin:0px 0px 0px
                                          0.8ex;border-left:1px solid
                                          rgb(204,204,204);padding-left:1ex">
                                          <div dir="ltr" class="">
                                            <div class="">Hello all,</div>
                                            <div class="">I'm very confused. Can
                                              the decision tree module
                                              handle both continuous and
                                              categorical features in
                                              the dataset? In this case,
                                              it's just CART
                                              (Classification and
                                              Regression Trees).<br class="">
                                            </div>
                                            <div class=""><br class="">
                                            </div>
                                            <div class="">For example,</div>
                                            <div class="">Gender Age Income 
                                              Car   Attendance<br class="">
                                            </div>
                                            <div class="">Male     30   10000  
                                              BMW          Yes<br class="">
                                            </div>
                                            <div class="">Female 35     9000 
                                              Toyota          No<br class="">
                                            </div>
                                            <div class="">Male     50   12000   
                                              Audi           Yes<br class="">
                                            </div>
                                            <div class=""><br class="">
                                            </div>
                                            <div class="">According to the
                                              documentation <a href="https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart" target="_blank" moz-do-not-send="true" class="">https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart</a>,
                                              it can not! <br class="">
                                            </div>
                                            <div class=""><br class="">
                                            </div>
                                            <div class="">It says: "scikit-learn
                                              implementation does not
                                              support categorical
                                              variables for now". <br class="">
                                            </div>
                                            <div class=""><br class="">
                                            </div>
                                            <div class="">Is this true? If not,
                                              can someone point me to an
                                              example? If yes, what do
                                              people do?<br class="">
                                            </div>
                                            <div class=""><br class="">
                                            </div>
                                            <div class="">Thank you very much!<br class="">
                                            </div>
                                            <div class=""><br class="">
                                            </div>
                                            <div class=""><br class="">
                                            </div>
                                            <div class=""><br class="">
                                            </div>
                                          </div>
_______________________________________________<br class="">
                                          scikit-learn mailing list<br class="">
                                          <a href="mailto:scikit-learn@python.org" target="_blank" moz-do-not-send="true" class="">scikit-learn@python.org</a><br class="">
                                          <a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank" moz-do-not-send="true" class="">https://mail.python.org/mailman/listinfo/scikit-learn</a><br class="">
                                        </blockquote>
                                      </div>
_______________________________________________<br class="">
                                      scikit-learn mailing list<br class="">
                                      <a href="mailto:scikit-learn@python.org" target="_blank" moz-do-not-send="true" class="">scikit-learn@python.org</a><br class="">
                                      <a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank" moz-do-not-send="true" class="">https://mail.python.org/mailman/listinfo/scikit-learn</a><br class="">
                                    </blockquote>
                                  </div>
_______________________________________________<br class="">
                                  scikit-learn mailing list<br class="">
                                  <a href="mailto:scikit-learn@python.org" target="_blank" moz-do-not-send="true" class="">scikit-learn@python.org</a><br class="">
                                  <a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank" moz-do-not-send="true" class="">https://mail.python.org/mailman/listinfo/scikit-learn</a><br class="">
                                </blockquote>
                              </div>
                              <br clear="all" class="">
                              <br class="">
                              -- <br class="">
                              <div dir="ltr" class="">
                                <div dir="ltr" class="">
                                  <div class="">
                                    <div dir="ltr" class="">
                                      <div class="">
                                        <div dir="ltr" class="">
                                          <div class="">Guillaume Lemaitre<br class="">
                                            INRIA Saclay - Parietal team<br class="">
                                            Center for Data Science
                                            Paris-Saclay<br class="">
                                            <a href="https://glemaitre.github.io/" target="_blank" moz-do-not-send="true" class="">https://glemaitre.github.io/</a></div>
                                        </div>
                                      </div>
                                    </div>
                                  </div>
                                </div>
                              </div>
                            </div>
                            <br class="">
                            <fieldset class=""></fieldset>
                            <pre class="">_______________________________________________
scikit-learn mailing list
<a href="mailto:scikit-learn@python.org" target="_blank" moz-do-not-send="true" class="">scikit-learn@python.org</a>
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" target="_blank" moz-do-not-send="true" class="">https://mail.python.org/mailman/listinfo/scikit-learn</a>
</pre>
                          </blockquote>
                          <br class="">
                        </div>
                        _______________________________________________<br class="">
                        scikit-learn mailing list<br class="">
                        <a href="mailto:scikit-learn@python.org" target="_blank" moz-do-not-send="true" class="">scikit-learn@python.org</a><br class="">
                        <a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank" moz-do-not-send="true" class="">https://mail.python.org/mailman/listinfo/scikit-learn</a><br class="">
                      </blockquote>
                    </div>
                    _______________________________________________<br class="">
                    scikit-learn mailing list<br class="">
                    <a href="mailto:scikit-learn@python.org" target="_blank" moz-do-not-send="true" class="">scikit-learn@python.org</a><br class="">
                    <a href="https://mail.python.org/mailman/listinfo/scikit-learn" target="_blank" moz-do-not-send="true" class="">https://mail.python.org/mailman/listinfo/scikit-learn</a><br class="">
                  </div>
                </blockquote>
              </div>
              <br class="">
            </div>
          </div>
          _______________________________________________<br class="">
          scikit-learn mailing list<br class="">
          <a href="mailto:scikit-learn@python.org" target="_blank" moz-do-not-send="true" class="">scikit-learn@python.org</a><br class="">
          <a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank" moz-do-not-send="true" class="">https://mail.python.org/mailman/listinfo/scikit-learn</a><br class="">
        </blockquote>
      </div>
      <br class="">
      <fieldset class="mimeAttachmentHeader"></fieldset>
      <pre class="moz-quote-pre" wrap="">_______________________________________________
scikit-learn mailing list
<a class="moz-txt-link-abbreviated" href="mailto:scikit-learn@python.org">scikit-learn@python.org</a>
<a class="moz-txt-link-freetext" href="https://mail.python.org/mailman/listinfo/scikit-learn">https://mail.python.org/mailman/listinfo/scikit-learn</a>
</pre>
    </blockquote>
  </div>

_______________________________________________<br class="">scikit-learn mailing list<br class=""><a href="mailto:scikit-learn@python.org" class="">scikit-learn@python.org</a><br class="">https://mail.python.org/mailman/listinfo/scikit-learn<br class=""></div></blockquote></div><br class=""></div></body></html>