[scikit-learn] Announcing skrub: Prepping tables for machine learning

Fernando Marcos Wittmann fernando.wittmann at gmail.com
Mon Dec 18 19:18:03 EST 2023


Very strong baseline indeed. Did a quick check with the Ames housing
dataset:
https://colab.research.google.com/drive/1RVVl_R5X3YYC7kj-B9uI5Fq7-SCYhYnD?usp=sharing

Thanks all for the contribution!

On Mon, Dec 18, 2023 at 2:49 PM Gael Varoquaux <
gael.varoquaux at normalesup.org> wrote:

> Hi everyone,
>
> We are very happy to announce the first release of a new package called
> "skrub". It's goal is to facilitate data preparation from tables to machine
> learning with an API similar to that of scikit-learn.
> https://skrub-data.org
>
> The most useful tool in the short term is the "TableVectorizer", which
> applies a bunch of heuristics to turn a complex into a good data
> representation for learning (for instance encoding dates, or strings).
> Combined with scikit-learn HistGradientBoosting, it gives a strong baseline
> for most tabular learning settings without data massaging:
>
> from sklearn.ensemble import HistGradientBoostingRegressor
> from sklearn.pipeline import make_pipeline
> from skrub import TableVectorizer
>
> pipeline = make_pipeline(TableVectorizer(),
> HistGradientBoostingRegressor())
> pipeline.fit(X, y)
>
>
> In the longer term, skrub will enable assembling full data processing
> pipelines across multiple tables that can be cross-validated with
> scikit_learn and one day put in production: Joining, Aggregation,
> transformation to build models directly from the original tables and
> database.
>
> One example of such pipeline can be seen here:
>
> https://skrub-data.org/stable/auto_examples/08_join_aggregation.html#chaining-everything-together-in-a-pipeline
>
> But there is a lot that remains to be done, and the questions are quite
> open.
>
> In my eyes, the dream is to bridge scikit-learn's API, that separates
> fit/transform (because it helps making robust and valid predictive
> pipelines) with dataframe/database operations. The goal is not to provide
> something as flexible as SQL or pandas, but the cover the most frequent
> usecases in machine learning, as explained here
> https://skrub-data.org/stable/vision.html
>
> Of course, skrub will be developed in the open, with an eye to quality,
> staying as lightweight as possible while still providing powerful tool. I
> hope that many will join this adventure!
>
> Cheers,
>
> Gaël
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/scikit-learn/attachments/20231218/75211217/attachment.html>


More information about the scikit-learn mailing list