[scikit-learn] New Transformer to Support Multiple Column Pipelines & One Hot Encoding

Tue Feb 20 14:53:20 EST 2018

Hi Dale,

Those two issues you mention are indeed current bottlenecks of sklearn's
API, but we are currently working on trying to solve them:

1) ColumnTransformer to be able to apply different transformers to
different columns: https://github.com/scikit-learn/scikit-learn/pull/9012/

2) As you mention, there is the CategoricalEncoder, which is exactly meant
to solve that problem. So indeed, with the upcoming release of sklearn,
this will be solved.

Both are already in the works (or have been merged), but further
contributions would certainly be welcome! In the first place, testing out
this functionality, seeing how it fits into your workflow and pipelines,
and provide feedback on this is very valuable. It's not yet released or
merged, so we can still make changes if necessary (for CategoricalEncoder
you can use sklearn master to test, for the ColumnTransformer you will need
to checkout the PR I mentioned above).
Secondly, there are still some open issues to further improve the
CategoricalEncoder, and help for those is also certainly welcome (see eg
https://github.com/scikit-learn/scikit-learn/issues/10181,
https://github.com/scikit-learn/scikit-learn/issues/10465, some kind of
'drop_first' parameter, ..)

Best,
Joris

2018-02-20 19:06 GMT+01:00 Dale Jacques <djacques at uwalumni.com>:

> Hello all,
>
> Long time lurker, first time emailer.
>
> I have two small contributions I would like to propose to the email list.
>
> I was working on a project this weekend that was using both categorical
> and numerical columns to predict a final output. I needed to save my
> transformations to make future predictions and grid search over multiple
> models and parameters, so sklearn pipelines were the obvious answer.  I
> setup a pipeline, grid searched, then pickled the best model to use for
> future predictions.
>
> This worked well, but I ran into two issues.
> *1).  I needed a transformer to select individual columns in my pipeline.
>  *I needed to apply unique transformations to each column in my data,
> then recombine with a FeatureUnion.  I realized there is not a supported
> transformer to extract a specific column within pipelines.  See this
> issue here as an example
> <https://stackoverflow.com/questions/39001956/sklearn-pipeline-how-to-apply-different-transformations-on-different-columns?rq=1>.
> I created a transformation that explicitly extracts columns of interest for
> use in a pipeline with FeatureUnion.  A FunctionTransformer will solve this
> issue, but I feel as if sklearn should directly and explicitly support this
> functionality.  I believe this will make pipelines significantly more
> intuitive and accessible for most users.
>
> *2).  One hot encoding requires arrays that are already integers.*  You
> can find a similar issue here
> <https://stackoverflow.com/questions/40456867/labelbinarizer-for-multiple-columns-in-data-frame>.
> This can be accomplished using Pandas.get_dummies() (where the
> transformation cannot be saved to apply to future predictions) or by using
> a scikit-learn LabelBinarizer
> <http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html>
> transformation.  LabelBinarizer is designed to transform y and does not
> have a method to pass x and y in a pipeline.  This breaks scikit-learn
> pipelines.  I built a LabelBinarizer transformation that can be used with
> FeatureUnion in pipelines.  This issue may be moot with the new
> CategoricalEncoder
> <http://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.CategoricalEncoder.html>
> that is about to be released.
>
> Does the community believe I should pursue contributing either of these?
>
> --
> Cheers,
>
> DJ
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180220/c6ad5c10/attachment.html>