[scikit-learn] A necessary feature for Decision trees

Andreas Mueller t3kcit at gmail.com
Thu Jan 4 13:45:17 EST 2018


Your contribution would be very welcome, I think the current work has 
stalled.


On 01/04/2018 10:02 AM, Julio Antonio Soto de Vicente wrote:
> Hi Yang Li,
>
> I have to agree with you. Bitset and/or one hot encoding are just 
> hacks which should not be necessary for decision tree learners.
>
> There is some WIP on an implementation for natural handling of 
> categorical features in trees: please take a look at 
> https://github.com/scikit-learn/scikit-learn/pull/4899
>
> Cheers!
>
> -- 
> Julio
>
> El 4 ene 2018, a las 9:06, 李扬 <sky188133882 at 163.com 
> <mailto:sky188133882 at 163.com>> escribió:
>
>> Dear J.B.,
>>
>> Thanks for your advice!
>>
>> Yeah, I have considered using bitstring or sequence number, but the 
>> problem is the algorithm not the representation of categorical data.
>> Take the regression tree as an example, the algorithm in sklearn find 
>> a split value of the feature, and find the best split by computing 
>> the minimal impurity of child nodes.
>> However, find a split of the categorical feature is not that 
>> meaningful even though u represent it as continuous value, and the 
>> split result is partially depends on how u permute the value in 
>> categorical feature, which is not very persuasive.
>> Instead, in the CART algorithm, *u should separate each category in 
>> the feature from others and compute the impurity of the two sets. 
>> Then find the best separation strategy with the minimal impurity.*
>> Obviously, this separation process can`t be finished by current 
>> algorithm which simply use the split method on continuous value.
>>
>> One more possible shortcoming is the categorical feature can`t be 
>> properly visualized. when forming a tree graph, it`s hard to get 
>> information from the categorical feature node while u just split it.
>>
>> Thank you for your time!
>> Best wishes.
>>
>>
>>
>>
>> --
>> 顺颂时祺!
>>
>> *
>> *
>> 李扬
>> 上海交通大学 电子信息 与 电气工程 学院
>> 电话:18818212371
>> 地址:上海市闵行区东川路800号
>> 邮编:200240
>>
>> Yang Li  +86 188 1821 2371
>> Shanghai Jiao Tong University
>> School of Electronic,Information and Electrical Engineering F1203026
>> 800 Dongchuan Road, Minhang District, Shanghai 200240
>>
>>
>>
>> At 2018-01-04 15:30:34, "Brown J.B. via scikit-learn" 
>> <scikit-learn at python.org <mailto:scikit-learn at python.org>> wrote:
>>
>>     Dear Yang Li,
>>
>>     > Neither the classificationTree nor the regressionTree supports
>>     categorical feature. That means the Decision trees model can only
>>     accept continuous feature.
>>
>>     Consider either manually encoding your categories in bitstrings
>>     (e.g., "Facebook" = 001, "Twitter" = 010, "Google" = 100), or
>>     using OneHotEncoder to do the same thing for you automatically.
>>
>>     Cheers,
>>     J.B.
>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org <mailto:scikit-learn at python.org>
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180104/419f0a94/attachment.html>


More information about the scikit-learn mailing list