[scikit-learn] ANN Dirty_cat: learning on dirty categories

Tue Nov 20 15:58:18 EST 2018

Hi scikit-learn friends,

As you might have seen on twitter, my lab -with a few friends- has
embarked on research to ease machine on "dirty data". We are
experimenting on new encoding methods for non-curated string categories.
For this, we are developing a small software project called "dirty_cat":
https://dirty-cat.github.io/stable/

dirty_cat is a test bed for new ideas of "dirty categories". It is a
research project, though we still try to do decent software engineering
:). Rather than contributing to existing codebases (as the great
categorical-encoding project in scikit-learn-contrib), we spanned it out
in a separate software project to have the freedom to try out ideas that
we might give up after gaining insight.

We hope that it is a useful tool: if you have non-curated string
categories, please give it a try. Understanding what works and what does
not is important to know what to consolidate. Hopefully one day we can
develop a tool that is of wide-enough interest that it can go in
scikit-learn-contrib, or maybe even scikit-learn.

Also, if you have suggestions of publicly available databases that we try
it upon, we would love to hear from you.

Cheers,

Gaël

PS: if you want to work on dirty-data problems in Paris as a post-doc or
an engineer, send me a line