<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
Your contribution would be very welcome, I think the current work
has stalled.<br>
<br>
<br>
<div class="moz-cite-prefix">On 01/04/2018 10:02 AM, Julio Antonio
Soto de Vicente wrote:<br>
</div>
<blockquote type="cite"
cite="mid:1C0DFDCA-503B-4C96-9358-25FCC4970457@esbet.es">
<meta http-equiv="content-type" content="text/html; charset=utf-8">
Hi Yang Li,
<div><br>
</div>
<div>I have to agree with you. Bitset and/or one hot encoding are
just hacks which should not be necessary for decision tree
learners.</div>
<div><br>
</div>
<div>There is some WIP on an implementation for natural handling
of categorical features in trees: please take a look at <a
href="https://github.com/scikit-learn/scikit-learn/pull/4899"
moz-do-not-send="true">https://github.com/scikit-learn/scikit-learn/pull/4899</a><br>
<br>
Cheers!<br>
<br>
<div>--
<div>Julio</div>
</div>
<div><br>
El 4 ene 2018, a las 9:06, 李扬 <<a
href="mailto:sky188133882@163.com" moz-do-not-send="true">sky188133882@163.com</a>>
escribió:<br>
<br>
</div>
<blockquote type="cite">
<div>
<div
style="line-height:1.7;color:#000000;font-size:14px;font-family:Arial">
<div>Dear J.B.,</div>
<div><br>
</div>
<div>Thanks for your advice!</div>
<div><br>
</div>
<div>Yeah, I have considered using bitstring or sequence
number, but the problem is the algorithm not the
representation of categorical data.</div>
<div>Take the regression tree as an example, the algorithm
in sklearn find a split value of the feature, and find
the best split by computing the minimal impurity of
child nodes.</div>
<div>However, find a split of the categorical feature is
not that meaningful even though u represent it as
continuous value, and the split result is partially
depends on how u permute the value in categorical
feature, which is not very persuasive.</div>
<div>Instead, in the CART algorithm, <b>u should separate
each category in the feature from others and compute
the impurity of the two sets. Then find the best
separation strategy with the minimal impurity.</b></div>
<div>Obviously, this separation process can`t be finished
by current algorithm which simply use the split method
on continuous value.</div>
<div><br>
</div>
<div>One more possible shortcoming is the categorical
feature can`t be properly visualized. when forming a
tree graph, it`s hard to get information from the
categorical feature node while u just split it.</div>
<div><br>
</div>
<div>Thank you for your time!</div>
<div>Best wishes.</div>
<br>
<br>
<br>
<br>
<div style="position:relative;zoom:1">--<br>
<div>
<div style="line-height: 23.7999992370605px;"><span
style="line-height: 23.8px;">顺颂时祺!</span></div>
<div style="line-height: 23.7999992370605px;"><span
style="line-height: 23.8px;"><br>
</span></div>
<div style="line-height: 23.7999992370605px;"><span
style="font-size: 18px;"><b><br>
</b></span></div>
<div style="line-height: 23.7999992370605px;"><span
style="font-size: 18px; font-family: 'Microsoft
Yahei';">李扬 </span></div>
<div style="line-height: 23.7999992370605px;"><span
style="font-family: 'Microsoft Yahei';">上海交通大学 <span
style="font-family: 'Microsoft Yahei';
line-height: 23.7999992370605px;">电子信息 与 电气工程 学院
</span></span></div>
<div style="line-height: 23.7999992370605px;"><span
style="font-family: 'Microsoft Yahei';">电话:18818212371</span></div>
<div style="line-height: 23.7999992370605px;"><span
style="font-family: 'Microsoft Yahei';">地址:上海市闵行区东川路800号</span></div>
<div style="line-height: 23.7999992370605px;"><span
style="font-family: 'Microsoft Yahei';">邮编:200240</span></div>
</div>
<div><br>
</div>
<div>Yang Li +86 188 1821 2371</div>
<div><span style="line-height: 23.7999992370605px;">Shanghai
Jiao Tong University</span></div>
<div>School of Electronic,Information and Electrical
Engineering F1203026</div>
<div>800 Dongchuan Road, Minhang District, Shanghai
200240</div>
<div><br>
</div>
<div><br>
</div>
<div> </div>
</div>
<br>
At 2018-01-04 15:30:34, "Brown J.B. via scikit-learn" <<a
href="mailto:scikit-learn@python.org"
moz-do-not-send="true">scikit-learn@python.org</a>>
wrote:<br>
<blockquote id="isReplyContent" style="PADDING-LEFT: 1ex;
MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">
<div dir="ltr">
<div>
<div>
<div>Dear Yang Li,<br>
<br>
> Neither the classificationTree nor the
regressionTree supports categorical feature.
That means the Decision trees model can only
accept continuous feature. <br>
<br>
</div>
Consider either manually encoding your categories
in bitstrings (e.g., "Facebook" = 001, "Twitter" =
010, "Google" = 100), or using OneHotEncoder to do
the same thing for you automatically.<br>
<br>
</div>
Cheers,<br>
</div>
J.B.<br>
</div>
</blockquote>
</div>
<br>
<br>
<span title="neteasefooter">
<p> </p>
</span></div>
</blockquote>
<blockquote type="cite">
<div><span>_______________________________________________</span><br>
<span>scikit-learn mailing list</span><br>
<span><a href="mailto:scikit-learn@python.org"
moz-do-not-send="true">scikit-learn@python.org</a></span><br>
<span><a
href="https://mail.python.org/mailman/listinfo/scikit-learn"
moz-do-not-send="true">https://mail.python.org/mailman/listinfo/scikit-learn</a></span><br>
</div>
</blockquote>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
scikit-learn mailing list
<a class="moz-txt-link-abbreviated" href="mailto:scikit-learn@python.org">scikit-learn@python.org</a>
<a class="moz-txt-link-freetext" href="https://mail.python.org/mailman/listinfo/scikit-learn">https://mail.python.org/mailman/listinfo/scikit-learn</a>
</pre>
</blockquote>
<br>
</body>
</html>