[scikit-learn] A necessary feature for Decision trees
Andreas Mueller
t3kcit at gmail.com
Thu Jan 4 13:45:17 EST 2018
Your contribution would be very welcome, I think the current work has
stalled.
On 01/04/2018 10:02 AM, Julio Antonio Soto de Vicente wrote:
> Hi Yang Li,
>
> I have to agree with you. Bitset and/or one hot encoding are just
> hacks which should not be necessary for decision tree learners.
>
> There is some WIP on an implementation for natural handling of
> categorical features in trees: please take a look at
> https://github.com/scikit-learn/scikit-learn/pull/4899
>
> Cheers!
>
> --
> Julio
>
> El 4 ene 2018, a las 9:06, 李扬 <sky188133882 at 163.com
> <mailto:sky188133882 at 163.com>> escribió:
>
>> Dear J.B.,
>>
>> Thanks for your advice!
>>
>> Yeah, I have considered using bitstring or sequence number, but the
>> problem is the algorithm not the representation of categorical data.
>> Take the regression tree as an example, the algorithm in sklearn find
>> a split value of the feature, and find the best split by computing
>> the minimal impurity of child nodes.
>> However, find a split of the categorical feature is not that
>> meaningful even though u represent it as continuous value, and the
>> split result is partially depends on how u permute the value in
>> categorical feature, which is not very persuasive.
>> Instead, in the CART algorithm, *u should separate each category in
>> the feature from others and compute the impurity of the two sets.
>> Then find the best separation strategy with the minimal impurity.*
>> Obviously, this separation process can`t be finished by current
>> algorithm which simply use the split method on continuous value.
>>
>> One more possible shortcoming is the categorical feature can`t be
>> properly visualized. when forming a tree graph, it`s hard to get
>> information from the categorical feature node while u just split it.
>>
>> Thank you for your time!
>> Best wishes.
>>
>>
>>
>>
>> --
>> 顺颂时祺!
>>
>> *
>> *
>> 李扬
>> 上海交通大学 电子信息 与 电气工程 学院
>> 电话:18818212371
>> 地址:上海市闵行区东川路800号
>> 邮编:200240
>>
>> Yang Li +86 188 1821 2371
>> Shanghai Jiao Tong University
>> School of Electronic,Information and Electrical Engineering F1203026
>> 800 Dongchuan Road, Minhang District, Shanghai 200240
>>
>>
>>
>> At 2018-01-04 15:30:34, "Brown J.B. via scikit-learn"
>> <scikit-learn at python.org <mailto:scikit-learn at python.org>> wrote:
>>
>> Dear Yang Li,
>>
>> > Neither the classificationTree nor the regressionTree supports
>> categorical feature. That means the Decision trees model can only
>> accept continuous feature.
>>
>> Consider either manually encoding your categories in bitstrings
>> (e.g., "Facebook" = 001, "Twitter" = 010, "Google" = 100), or
>> using OneHotEncoder to do the same thing for you automatically.
>>
>> Cheers,
>> J.B.
>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org <mailto:scikit-learn at python.org>
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180104/419f0a94/attachment.html>
More information about the scikit-learn
mailing list