[scikit-learn] Classifiers for dataset with categorical features

Gael Varoquaux gael.varoquaux at normalesup.org
Wed Jul 26 03:02:28 EDT 2017


The right thing to do would probably be to write a scikit-learn-contrib
package for them and see if they gather traction. If they perform well on
eg kaggle competitions, we know that we need them in :).

Cheers,

Gaël

On Fri, Jul 21, 2017 at 07:09:03PM -0400, Sebastian Raschka wrote:
> Maybe because they are genetic algorithms, which are -- for some reason -- not very popular in the ML field in general :P. (People in bioinformatics seem to use them a lot though.). Also, the name "Learning Classifier Systems" is also a bit weird I'd must say: I remember that when Ryan introduced me to those, I was like "ah yeah, sure, I know machine learning classifiers" ;)



> > On Jul 21, 2017, at 3:01 PM, Stuart Reynolds <stuart at stuartreynolds.net> wrote:

> > +1
> > LCS and its many many variants seem very practical and adaptable. I'm
> > not sure why they haven't gotten traction.
> > Overshadowed by GBM & random forests?


> > On Fri, Jul 21, 2017 at 11:52 AM, Sebastian Raschka
> > <se.raschka at gmail.com> wrote:
> >> Just to throw some additional ideas in here. Based on a conversation with a colleague some time ago, I think learning classifier systems (https://en.wikipedia.org/wiki/Learning_classifier_system) are particularly useful when working with large, sparse binary vectors (like from a one-hot encoding). I am really not into LCS's, and only know the basics (read through the first chapters of the Intro to Learning Classifier Systems draft; the print version will be out later this year).
> >> Also, I saw an interesting poster on a Set Covering Machine algorithm once, which they benchmarked against SVMs, random forests and the like for categorical (genomics data). Looked promising.

> >> Best,
> >> Sebastian


> >>> On Jul 21, 2017, at 2:37 PM, Raga Markely <raga.markely at gmail.com> wrote:

> >>> Thank you, Jacob. Appreciate it.

> >>> Regarding 'perform better', I was referring to better accuracy, precision, recall, F1 score, etc.

> >>> Thanks,
> >>> Raga

> >>> On Fri, Jul 21, 2017 at 2:27 PM, Jacob Schreiber <jmschreiber91 at gmail.com> wrote:
> >>> Traditionally tree based methods are very good when it comes to categorical variables and can handle them appropriately. There is a current WIP PR to add this support to sklearn. I'm not exactly sure what you mean that "perform better" though. Estimators that ignore the categorical aspect of these variables and treat them as discrete will likely perform worse than those that treat them appropriately.

> >>> On Fri, Jul 21, 2017 at 8:11 AM, Raga Markely <raga.markely at gmail.com> wrote:
> >>> Hello,

> >>> I am wondering if there are some classifiers that perform better for datasets with categorical features (converted into sparse input matrix with pd.get_dummies())? The data for the categorical features are nominal (order doesn't matter, e.g. country, occupation, etc).

> >>> If you could provide me some references (papers, books, website, etc), that would be great.

> >>> Thank you very much!
> >>> Raga



> >>> _______________________________________________
> >>> scikit-learn mailing list
> >>> scikit-learn at python.org
> >>> https://mail.python.org/mailman/listinfo/scikit-learn



> >>> _______________________________________________
> >>> scikit-learn mailing list
> >>> scikit-learn at python.org
> >>> https://mail.python.org/mailman/listinfo/scikit-learn


> >>> _______________________________________________
> >>> scikit-learn mailing list
> >>> scikit-learn at python.org
> >>> https://mail.python.org/mailman/listinfo/scikit-learn

> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn at python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn

> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-- 
    Gael Varoquaux
    Researcher, INRIA Parietal
    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
    Phone:  ++ 33-1-69-08-79-68
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux


More information about the scikit-learn mailing list