[scikit-learn] Classifiers for dataset with categorical features
Sebastian Raschka
se.raschka at gmail.com
Fri Jul 21 19:09:03 EDT 2017
Maybe because they are genetic algorithms, which are -- for some reason -- not very popular in the ML field in general :P. (People in bioinformatics seem to use them a lot though.). Also, the name "Learning Classifier Systems" is also a bit weird I'd must say: I remember that when Ryan introduced me to those, I was like "ah yeah, sure, I know machine learning classifiers" ;)
> On Jul 21, 2017, at 3:01 PM, Stuart Reynolds <stuart at stuartreynolds.net> wrote:
>
> +1
> LCS and its many many variants seem very practical and adaptable. I'm
> not sure why they haven't gotten traction.
> Overshadowed by GBM & random forests?
>
>
> On Fri, Jul 21, 2017 at 11:52 AM, Sebastian Raschka
> <se.raschka at gmail.com> wrote:
>> Just to throw some additional ideas in here. Based on a conversation with a colleague some time ago, I think learning classifier systems (https://en.wikipedia.org/wiki/Learning_classifier_system) are particularly useful when working with large, sparse binary vectors (like from a one-hot encoding). I am really not into LCS's, and only know the basics (read through the first chapters of the Intro to Learning Classifier Systems draft; the print version will be out later this year).
>> Also, I saw an interesting poster on a Set Covering Machine algorithm once, which they benchmarked against SVMs, random forests and the like for categorical (genomics data). Looked promising.
>>
>> Best,
>> Sebastian
>>
>>
>>> On Jul 21, 2017, at 2:37 PM, Raga Markely <raga.markely at gmail.com> wrote:
>>>
>>> Thank you, Jacob. Appreciate it.
>>>
>>> Regarding 'perform better', I was referring to better accuracy, precision, recall, F1 score, etc.
>>>
>>> Thanks,
>>> Raga
>>>
>>> On Fri, Jul 21, 2017 at 2:27 PM, Jacob Schreiber <jmschreiber91 at gmail.com> wrote:
>>> Traditionally tree based methods are very good when it comes to categorical variables and can handle them appropriately. There is a current WIP PR to add this support to sklearn. I'm not exactly sure what you mean that "perform better" though. Estimators that ignore the categorical aspect of these variables and treat them as discrete will likely perform worse than those that treat them appropriately.
>>>
>>> On Fri, Jul 21, 2017 at 8:11 AM, Raga Markely <raga.markely at gmail.com> wrote:
>>> Hello,
>>>
>>> I am wondering if there are some classifiers that perform better for datasets with categorical features (converted into sparse input matrix with pd.get_dummies())? The data for the categorical features are nominal (order doesn't matter, e.g. country, occupation, etc).
>>>
>>> If you could provide me some references (papers, books, website, etc), that would be great.
>>>
>>> Thank you very much!
>>> Raga
>>>
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
More information about the scikit-learn
mailing list