[scikit-learn] Feature engineering functionality - new package

Andreas Mueller t3kcit at gmail.com
Mon Apr 15 10:55:11 EDT 2019


1) was indeed a design decision. Your design is certainly an alternative 
design, that might be more convenient in some situations,
but requires adding this feature to all transformers, which basically 
just adds a bunch of boilerplate code everywhere.
So you could argue our design decision was more driven by ease of 
maintenance than ease of use.

There might be some transformers in your package that we could add to 
scikit-learn in some form, but several are already available,
SimpleImputer implements MedianMeanImputer, CategoricalVariableImputer 
and FrequentCategoryImputer
We don't currently have RandomSampleImputer and EndTailImputer, I think. 
AddNaNBinaryImputer is "MissingIndicator" in sklearn.

OneHotCategoricalEncoder and OrdinalEncoder exist, 
CountFrequencyCategoricalEncoder and MeanCategoriclaEncoder are in the 
works,
though there are some arguments about the details. These are also in the 
categorical-encoding package:
http://contrib.scikit-learn.org/categorical-encoding/

RareLabelCategoricalEncoder is something I definitely want in 
OneHotEncoder, not sure if there's a PR yet.

Do you have examples of WoERatioCategoricalEncoder or Windsorizer or any 
of the discretizers actually working well in practice?
I have not seen them used much, they seemed to be popular in Weka, though.

BoxCoxTransformer is implemented in PowerTransformer, and 
LogTransformer, ReciprocalTransformer and ExponentialTransformer can be
implemented as FunctionTransformer(np.log), FunctionTransformer(lambda 
x: 1/x) and FunctionTransformer(lambda x: x ** exp) I believe.

It might be interesting to add your package to scikit-learn-contrib:
https://github.com/scikit-learn-contrib

We are struggling a bit with how to best organize that, though.

Cheers,
Andy


On 4/10/19 2:13 PM, Sole Galli wrote:
> Hi Nicolas,
>
> You are right, I am just checking this in the source code.
>
> Sorry for the confusion and thanks for the quick response
>
> Cheers
>
> Sole
>
> On Wed, 10 Apr 2019 at 18:43, Nicolas Goix <goix.nicolas at gmail.com 
> <mailto:goix.nicolas at gmail.com>> wrote:
>
>     Hi Sole,
>
>     I'm not sure the 2 limitations you mentioned are correct.
>     1) in your example, using the ColumnTransformer you can impute
>     different values for different columns.
>     2) the sklearn transformers do learn on the training set and are
>     able to perpetuate the values learnt from the train set to unseen
>     data.
>
>     Nicolas
>
>     On Wed, Apr 10, 2019, 18:25 Sole Galli <solegalli1 at gmail.com
>     <mailto:solegalli1 at gmail.com>> wrote:
>
>             Dear Scikit-Learn team,
>
>             Feature engineering is a big task ahead of building
>             machine learning models. It involves imputation of missing
>             values, encoding of categorical variables, discretisation,
>             variable transformation etc.
>
>             Sklearn includes some functionality for feature
>             engineering, which is useful, but it has a few limitations:
>
>             1) it does not allow for feature specification - it will
>             do the same process on all variables, for example
>             SimpleImputer. Typically, we want to impute different
>             columns with different values.
>             2) It does not capture information from the training set,
>             this is it does not learn, therefore, it is not able to
>             perpetuate the values learnt from the train set, to unseen
>             data.
>
>             The 2 limitations above apply to all the feature
>             transformers in sklearn, I believe.
>
>             Therefore, if these transformers are used as part of a
>             pipeline, we could end up doing different transformations
>             to train and test, depending on the characteristics of the
>             datasets. For business purposes, this is not a desired option.
>
>             I think that building transformers that learn from the
>             train set would be of much use for the community.
>
>             To this end, I built a python package called feature
>             engine <https://pypi.org/project/feature-engine/> which
>             expands the sklearn-api with additional feature
>             engineering techniques, and the functionality that allows
>             the transformer to learn from data and store the
>             parameters learnt.
>
>             The techniques included have been used worldwide, both in
>             business and in data competitions, and reported in kdd
>             reports and other articles. I also cover them in an udemy
>             course
>             <https://www.udemy.com/feature-engineering-for-machine-learning>
>             which has enrolled several thousand students.
>
>             The package capitalises on the use of pandas to capture
>             the features, but I am confident that the columns names
>             could be captured and the df transformed to a numpy array
>             to comply with sklearn requirements.
>
>             I wondered whether it would be of interest to include the
>             functionality of this package within sklearn?
>             If you would consider extending the sklearn api to include
>             these transformers, I would be happy to help.
>
>             Alternatively, would you consider to add the package to
>             your website? where you mention the libaries that extend
>             sklearn functionality?
>
>             All feedback is welcome.
>
>             Many thanks and I look forward to hearing from you
>
>             Thank you so much fur such an awesome contribution through
>             the sklearn api
>
>             Kind regards
>
>             Sole
>
>         _______________________________________________
>         scikit-learn mailing list
>         scikit-learn at python.org <mailto:scikit-learn at python.org>
>         https://mail.python.org/mailman/listinfo/scikit-learn
>
>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>     https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190415/460659db/attachment.html>


More information about the scikit-learn mailing list