[scikit-learn] Announcing skrub: Prepping tables for machine learning

Mon Dec 18 12:49:31 EST 2023

Hi everyone,

We are very happy to announce the first release of a new package called "skrub". It's goal is to facilitate data preparation from tables to machine learning with an API similar to that of scikit-learn.
https://skrub-data.org

The most useful tool in the short term is the "TableVectorizer", which applies a bunch of heuristics to turn a complex into a good data representation for learning (for instance encoding dates, or strings). Combined with scikit-learn HistGradientBoosting, it gives a strong baseline for most tabular learning settings without data massaging:

from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.pipeline import make_pipeline
from skrub import TableVectorizer

pipeline = make_pipeline(TableVectorizer(), HistGradientBoostingRegressor())
pipeline.fit(X, y)

In the longer term, skrub will enable assembling full data processing pipelines across multiple tables that can be cross-validated with scikit_learn and one day put in production: Joining, Aggregation, transformation to build models directly from the original tables and database.

One example of such pipeline can be seen here:
https://skrub-data.org/stable/auto_examples/08_join_aggregation.html#chaining-everything-together-in-a-pipeline

But there is a lot that remains to be done, and the questions are quite open.

In my eyes, the dream is to bridge scikit-learn's API, that separates fit/transform (because it helps making robust and valid predictive pipelines) with dataframe/database operations. The goal is not to provide something as flexible as SQL or pandas, but the cover the most frequent usecases in machine learning, as explained here https://skrub-data.org/stable/vision.html

Of course, skrub will be developed in the open, with an eye to quality, staying as lightweight as possible while still providing powerful tool. I hope that many will join this adventure!

Cheers,

Gaël