<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    Your contribution would be very welcome, I think the current work
    has stalled.<br>
    <br>
    <br>
    <div class="moz-cite-prefix">On 01/04/2018 10:02 AM, Julio Antonio
      Soto de Vicente wrote:<br>
    </div>
    <blockquote type="cite"
      cite="mid:1C0DFDCA-503B-4C96-9358-25FCC4970457@esbet.es">
      <meta http-equiv="content-type" content="text/html; charset=utf-8">
      Hi Yang Li,
      <div><br>
      </div>
      <div>I have to agree with you. Bitset and/or one hot encoding are
        just hacks which should not be necessary for decision tree
        learners.</div>
      <div><br>
      </div>
      <div>There is some WIP on an implementation for natural handling
        of categorical features in trees: please take a look at <a
          href="https://github.com/scikit-learn/scikit-learn/pull/4899"
          moz-do-not-send="true">https://github.com/scikit-learn/scikit-learn/pull/4899</a><br>
        <br>
        Cheers!<br>
        <br>
        <div>--
          <div>Julio</div>
        </div>
        <div><br>
          El 4 ene 2018, a las 9:06, 李扬 <<a
            href="mailto:sky188133882@163.com" moz-do-not-send="true">sky188133882@163.com</a>>
          escribió:<br>
          <br>
        </div>
        <blockquote type="cite">
          <div>
            <div
              style="line-height:1.7;color:#000000;font-size:14px;font-family:Arial">
              <div>Dear J.B.,</div>
              <div><br>
              </div>
              <div>Thanks for your advice!</div>
              <div><br>
              </div>
              <div>Yeah, I have considered using bitstring or sequence
                number, but the problem is the algorithm not the
                representation of categorical data.</div>
              <div>Take the regression tree as an example, the algorithm
                in sklearn find a split value of the feature, and find
                the best split by computing the minimal impurity of
                child nodes.</div>
              <div>However, find a split of the categorical feature is
                not that meaningful even though u represent it as
                continuous value, and the split result is partially
                depends on how u permute the value in categorical 
                feature, which is not very persuasive.</div>
              <div>Instead, in the CART algorithm, <b>u should separate
                  each category in the feature from others and compute
                  the impurity of the two sets. Then find the best
                  separation strategy with the minimal impurity.</b></div>
              <div>Obviously, this separation process can`t be finished
                by current algorithm which simply use the split method
                on continuous value.</div>
              <div><br>
              </div>
              <div>One more possible shortcoming is the categorical
                feature can`t be properly visualized. when forming a
                tree graph, it`s hard to get information from the
                categorical feature node while u just split it.</div>
              <div><br>
              </div>
              <div>Thank you for your time!</div>
              <div>Best wishes.</div>
              <br>
              <br>
              <br>
              <br>
              <div style="position:relative;zoom:1">--<br>
                <div>
                  <div style="line-height: 23.7999992370605px;"><span
                      style="line-height: 23.8px;">顺颂时祺!</span></div>
                  <div style="line-height: 23.7999992370605px;"><span
                      style="line-height: 23.8px;"><br>
                    </span></div>
                  <div style="line-height: 23.7999992370605px;"><span
                      style="font-size: 18px;"><b><br>
                      </b></span></div>
                  <div style="line-height: 23.7999992370605px;"><span
                      style="font-size: 18px; font-family: 'Microsoft
                      Yahei';">李扬 </span></div>
                  <div style="line-height: 23.7999992370605px;"><span
                      style="font-family: 'Microsoft Yahei';">上海交通大学  <span
                        style="font-family: 'Microsoft Yahei';
                        line-height: 23.7999992370605px;">电子信息 与 电气工程 学院
                         </span></span></div>
                  <div style="line-height: 23.7999992370605px;"><span
                      style="font-family: 'Microsoft Yahei';">电话:18818212371</span></div>
                  <div style="line-height: 23.7999992370605px;"><span
                      style="font-family: 'Microsoft Yahei';">地址:上海市闵行区东川路800号</span></div>
                  <div style="line-height: 23.7999992370605px;"><span
                      style="font-family: 'Microsoft Yahei';">邮编:200240</span></div>
                </div>
                <div><br>
                </div>
                <div>Yang Li  +86 188 1821 2371</div>
                <div><span style="line-height: 23.7999992370605px;">Shanghai
                    Jiao Tong University</span></div>
                <div>School of Electronic,Information and Electrical
                  Engineering F1203026</div>
                <div>800 Dongchuan Road, Minhang District, Shanghai
                  200240</div>
                <div><br>
                </div>
                <div><br>
                </div>
                <div> </div>
              </div>
              <br>
              At 2018-01-04 15:30:34, "Brown J.B. via scikit-learn" <<a
                href="mailto:scikit-learn@python.org"
                moz-do-not-send="true">scikit-learn@python.org</a>>
              wrote:<br>
              <blockquote id="isReplyContent" style="PADDING-LEFT: 1ex;
                MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">
                <div dir="ltr">
                  <div>
                    <div>
                      <div>Dear Yang Li,<br>
                        <br>
                        > Neither the classificationTree nor the
                        regressionTree supports categorical feature.
                        That means the Decision trees model can only
                        accept continuous feature. <br>
                        <br>
                      </div>
                      Consider either manually encoding your categories
                      in bitstrings (e.g., "Facebook" = 001, "Twitter" =
                      010, "Google" = 100), or using OneHotEncoder to do
                      the same thing for you automatically.<br>
                      <br>
                    </div>
                    Cheers,<br>
                  </div>
                  J.B.<br>
                </div>
              </blockquote>
            </div>
            <br>
            <br>
            <span title="neteasefooter">
              <p> </p>
            </span></div>
        </blockquote>
        <blockquote type="cite">
          <div><span>_______________________________________________</span><br>
            <span>scikit-learn mailing list</span><br>
            <span><a href="mailto:scikit-learn@python.org"
                moz-do-not-send="true">scikit-learn@python.org</a></span><br>
            <span><a
                href="https://mail.python.org/mailman/listinfo/scikit-learn"
                moz-do-not-send="true">https://mail.python.org/mailman/listinfo/scikit-learn</a></span><br>
          </div>
        </blockquote>
      </div>
      <br>
      <fieldset class="mimeAttachmentHeader"></fieldset>
      <br>
      <pre wrap="">_______________________________________________
scikit-learn mailing list
<a class="moz-txt-link-abbreviated" href="mailto:scikit-learn@python.org">scikit-learn@python.org</a>
<a class="moz-txt-link-freetext" href="https://mail.python.org/mailman/listinfo/scikit-learn">https://mail.python.org/mailman/listinfo/scikit-learn</a>
</pre>
    </blockquote>
    <br>
  </body>
</html>