[scikit-learn] A necessary feature for Decision trees

李扬 sky188133882 at 163.com
Thu Jan 4 03:06:22 EST 2018


Dear J.B.,


Thanks for your advice!


Yeah, I have considered using bitstring or sequence number, but the problem is the algorithm not the representation of categorical data.
Take the regression tree as an example, the algorithm in sklearn find a split value of the feature, and find the best split by computing the minimal impurity of child nodes.
However, find a split of the categorical feature is not that meaningful even though u represent it as continuous value, and the split result is partially depends on how u permute the value in categorical  feature, which is not very persuasive.
Instead, in the CART algorithm, u should separate each category in the feature from others and compute the impurity of the two sets. Then find the best separation strategy with the minimal impurity.
Obviously, this separation process can`t be finished by current algorithm which simply use the split method on continuous value.


One more possible shortcoming is the categorical feature can`t be properly visualized. when forming a tree graph, it`s hard to get information from the categorical feature node while u just split it.


Thank you for your time!
Best wishes.





--

顺颂时祺!




李扬 
上海交通大学  电子信息 与 电气工程 学院  
电话:18818212371
地址:上海市闵行区东川路800号
邮编:200240


Yang Li  +86 188 1821 2371
Shanghai Jiao Tong University
School of Electronic,Information and Electrical Engineering F1203026
800 Dongchuan Road, Minhang District, Shanghai 200240




 

At 2018-01-04 15:30:34, "Brown J.B. via scikit-learn" <scikit-learn at python.org> wrote:

Dear Yang Li,

> Neither the classificationTree nor the regressionTree supports categorical feature. That means the Decision trees model can only accept continuous feature.


Consider either manually encoding your categories in bitstrings (e.g., "Facebook" = 001, "Twitter" = 010, "Google" = 100), or using OneHotEncoder to do the same thing for you automatically.


Cheers,

J.B.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180104/f892c5e9/attachment.html>


More information about the scikit-learn mailing list