[scikit-learn] Feature engineering functionality - new package

Sole Galli solegalli1 at gmail.com
Wed Apr 10 13:23:03 EDT 2019


>
> Dear Scikit-Learn team,
>
> Feature engineering is a big task ahead of building machine learning
> models. It involves imputation of missing values, encoding of categorical
> variables, discretisation, variable transformation etc.
>
> Sklearn includes some functionality for feature engineering, which is
> useful, but it has a few limitations:
>
> 1) it does not allow for feature specification - it will do the same
> process on all variables, for example SimpleImputer. Typically, we want
> to impute different columns with different values.
> 2) It does not capture information from the training set, this is it does
> not learn, therefore, it is not able to perpetuate the values learnt from
> the train set, to unseen data.
>
> The 2 limitations above apply to all the feature transformers in sklearn,
> I believe.
>
> Therefore, if these transformers are used as part of a pipeline, we could
> end up doing different transformations to train and test, depending on the
> characteristics of the datasets. For business purposes, this is not a
> desired option.
>
> I think that building transformers that learn from the train set would be
> of much use for the community.
>
> To this end, I built a python package called feature engine
> <https://pypi.org/project/feature-engine/> which expands the sklearn-api
> with additional feature engineering techniques, and the functionality that
> allows the transformer to learn from data and store the parameters learnt.
>
> The techniques included have been used worldwide, both in business and in
> data competitions, and reported in kdd reports and other articles. I also
> cover them in an udemy course
> <https://www.udemy.com/feature-engineering-for-machine-learning> which
> has enrolled several thousand students.
>
> The package capitalises on the use of pandas to capture the features, but
> I am confident that the columns names could be captured and the df
> transformed to a numpy array to comply with sklearn requirements.
>
> I wondered whether it would be of interest to include the functionality of
> this package within sklearn?
> If you would consider extending the sklearn api to include these
> transformers, I would be happy to help.
>
> Alternatively, would you consider to add the package to your website?
> where you mention the libaries that extend sklearn functionality?
>
> All feedback is welcome.
>
> Many thanks and I look forward to hearing from you
>
> Thank you so much fur such an awesome contribution through the sklearn api
>
> Kind regards
>
> Sole
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190410/e8209d78/attachment.html>


More information about the scikit-learn mailing list