<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    Your contribution would be very welcome, I think the current work

    has stalled.<br>

    <br>

    <br>

    <div class="moz-cite-prefix">On 01/04/2018 10:02 AM, Julio Antonio

      Soto de Vicente wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:1C0DFDCA-503B-4C96-9358-25FCC4970457@esbet.es">

      <meta http-equiv="content-type" content="text/html; charset=utf-8">

      Hi Yang Li,

      <div><br>

      </div>

      <div>I have to agree with you. Bitset and/or one hot encoding are

        just hacks which should not be necessary for decision tree

        learners.</div>

      <div><br>

      </div>

      <div>There is some WIP on an implementation for natural handling

        of categorical features in trees: please take a look at <a

          href="https://github.com/scikit-learn/scikit-learn/pull/4899"

          moz-do-not-send="true">https://github.com/scikit-learn/scikit-learn/pull/4899</a><br>

        <br>

        Cheers!<br>

        <br>

        <div>--

          <div>Julio</div>

        </div>

        <div><br>

          El 4 ene 2018, a las 9:06, 李扬 <<a

            href="mailto:sky188133882@163.com" moz-do-not-send="true">sky188133882@163.com</a>>

          escribió:<br>

          <br>

        </div>

        <blockquote type="cite">

          <div>

            <div

              style="line-height:1.7;color:#000000;font-size:14px;font-family:Arial">

              <div>Dear J.B.,</div>

              <div><br>

              </div>

              <div>Thanks for your advice!</div>

              <div><br>

              </div>

              <div>Yeah, I have considered using bitstring or sequence

                number, but the problem is the algorithm not the

                representation of categorical data.</div>

              <div>Take the regression tree as an example, the algorithm

                in sklearn find a split value of the feature, and find

                the best split by computing the minimal impurity of

                child nodes.</div>

              <div>However, find a split of the categorical feature is

                not that meaningful even though u represent it as

                continuous value, and the split result is partially

                depends on how u permute the value in categorical 

                feature, which is not very persuasive.</div>

              <div>Instead, in the CART algorithm, <b>u should separate

                  each category in the feature from others and compute

                  the impurity of the two sets. Then find the best

                  separation strategy with the minimal impurity.</b></div>

              <div>Obviously, this separation process can`t be finished

                by current algorithm which simply use the split method

                on continuous value.</div>

              <div><br>

              </div>

              <div>One more possible shortcoming is the categorical

                feature can`t be properly visualized. when forming a

                tree graph, it`s hard to get information from the

                categorical feature node while u just split it.</div>

              <div><br>

              </div>

              <div>Thank you for your time!</div>

              <div>Best wishes.</div>

              <br>

              <br>

              <br>

              <br>

              <div style="position:relative;zoom:1">--<br>

                <div>

                  <div style="line-height: 23.7999992370605px;"><span

                      style="line-height: 23.8px;">顺颂时祺！</span></div>

                  <div style="line-height: 23.7999992370605px;"><span

                      style="line-height: 23.8px;"><br>

                    </span></div>

                  <div style="line-height: 23.7999992370605px;"><span

                      style="font-size: 18px;"><b><br>

                      </b></span></div>

                  <div style="line-height: 23.7999992370605px;"><span

                      style="font-size: 18px; font-family: 'Microsoft

                      Yahei';">李扬 </span></div>

                  <div style="line-height: 23.7999992370605px;"><span

                      style="font-family: 'Microsoft Yahei';">上海交通大学  <span

                        style="font-family: 'Microsoft Yahei';

                        line-height: 23.7999992370605px;">电子信息 与 电气工程 学院

                         </span></span></div>

                  <div style="line-height: 23.7999992370605px;"><span

                      style="font-family: 'Microsoft Yahei';">电话：18818212371</span></div>

                  <div style="line-height: 23.7999992370605px;"><span

                      style="font-family: 'Microsoft Yahei';">地址：上海市闵行区东川路800号</span></div>

                  <div style="line-height: 23.7999992370605px;"><span

                      style="font-family: 'Microsoft Yahei';">邮编：200240</span></div>

                </div>

                <div><br>

                </div>

                <div>Yang Li  +86 188 1821 2371</div>

                <div><span style="line-height: 23.7999992370605px;">Shanghai

                    Jiao Tong University</span></div>

                <div>School of Electronic，Information and Electrical

                  Engineering F1203026</div>

                <div>800 Dongchuan Road, Minhang District, Shanghai

                  200240</div>

                <div><br>

                </div>

                <div><br>

                </div>

                <div> </div>

              </div>

              <br>

              At 2018-01-04 15:30:34, "Brown J.B. via scikit-learn" <<a

                href="mailto:scikit-learn@python.org"

                moz-do-not-send="true">scikit-learn@python.org</a>>

              wrote:<br>

              <blockquote id="isReplyContent" style="PADDING-LEFT: 1ex;

                MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">

                <div dir="ltr">

                  <div>

                    <div>

                      <div>Dear Yang Li,<br>

                        <br>

                        > Neither the classificationTree nor the

                        regressionTree supports categorical feature.

                        That means the Decision trees model can only

                        accept continuous feature. <br>

                        <br>

                      </div>

                      Consider either manually encoding your categories

                      in bitstrings (e.g., "Facebook" = 001, "Twitter" =

                      010, "Google" = 100), or using OneHotEncoder to do

                      the same thing for you automatically.<br>

                      <br>

                    </div>

                    Cheers,<br>

                  </div>

                  J.B.<br>

                </div>

              </blockquote>

            </div>

            <br>

            <br>

            <span title="neteasefooter">

              <p> </p>

            </span></div>

        </blockquote>

        <blockquote type="cite">

          <div><span>_______________________________________________</span><br>

            <span>scikit-learn mailing list</span><br>

            <span><a href="mailto:scikit-learn@python.org"

                moz-do-not-send="true">scikit-learn@python.org</a></span><br>

            <span><a

                href="https://mail.python.org/mailman/listinfo/scikit-learn"

                moz-do-not-send="true">https://mail.python.org/mailman/listinfo/scikit-learn</a></span><br>

          </div>

        </blockquote>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

scikit-learn mailing list

<a class="moz-txt-link-abbreviated" href="mailto:scikit-learn@python.org">scikit-learn@python.org</a>

<a class="moz-txt-link-freetext" href="https://mail.python.org/mailman/listinfo/scikit-learn">https://mail.python.org/mailman/listinfo/scikit-learn</a>

</pre>

    </blockquote>

    <br>

  </body>

</html>