[scikit-learn] Feature engineering functionality - new package

Tue Apr 23 21:36:16 EDT 2019

Hi Andreas and team,

Thank you very much for your reply. This was very helpful. Happy to hear
that functionality similar to CountFrequencyCategoricalEncoder,
MeanCategoriclaEncoder and RareLabelCategoricalEncoder are in the agenda.
The last functionality, grouping of rare labels, would be useful for both
the OneHotEncoder and OrdinalEncoder, as per a previous thread.

-------------------------
Re: your questions:

Examples of various discretisers can be found in the winner solutions of
the KDD 2009 annual competition article
<http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf>s: See for example:

   - Bulletpoints in page 26, which include use of decision trees to create
   bins.
   - Summary of employed methods in page 14: "Discretization  was the
   second most used preprocessing. Its usefulness for this particular dataset
   is justified by the non-normality of the distribution of the variables and
   the existence of extreme  values. The simple bining used by the winners of
   the slow track proved to be efficient. "
   - A peculiar binning described in 2.2 in page 36
   - I also use discretisers at work, inspired on the KDD articles, see for
   example my blog at the peer-to-peer company
   <https://blog.zopa.com/2017/07/20/tips-honing-logistic-regression-models/>,
   which I would argue attest to successful implementation:p
   - Equal width and equal frequency discretisers are discussed in this
   master thesis
   <https://www.ke.tu-darmstadt.de/lehre/arbeiten/studien/2015/Dong_Ying.pdf>
   .

Windsorisation, or top coding: we these use all the time in the industry,
usually capping at arbitrary values. Windsorisation using mean and std or
quantiles is a way of automating the capping. In theory it would boost
performance of linear models. Have tried that myself in a couple of toy
datasets from Kaggle. I don't have a good article to point you to at the
moment. There are a few that discuss topcoding, and also the effect of
outiers on NN, but not too sure how widely accepted they are.

On WoE, I understand is common practice in finance. Haven't used it at
work. Have used it in toy datasets, behaves more or less the same than
target mean encoding. Although the purpose of WoE goes beyond than
improving performance, it is also a way of "standarising" the variables and
making them understandable. See for example this summary.
<http://documentation.statsoft.com/StatisticaHelp.aspx?path=WeightofEvidence/WeightofEvidenceWoEIntroductoryOverview>

I know that sklearn likes to include algorithms widely accepted, ideally
from multi-quoted articles. So for winsorisation and WoE I am not quite
answering your questions I guess. I will keep an eye in case something new
comes up.

------------------
Re: sharing feature-engine in sklearn contrib.

I would really appreciate if you could do that. I am planning to expand the
package with other feature engineering techniques, which I think will be
useful for the community. In particular, until ColumnTransformer becomes
widely adopted and the other transformers developed. Would be great if it
could be shared in the contrib page
<https://github.com/scikit-learn-contrib> and also int the related projects
<https://scikit-learn.org/stable/related_projects.html> page.

----------------
Re: the categorical encoding package

I am aware that it exists. Haven't tried it myself. When we presented it to
the company, the main criticism was that most of the encoders distort the
variables so much that they lose all possible human interpretation of them.
So, the business prefers not to use these types of encoding. Which, I think
I kind of agree.

Thanks again for your time. Let me know if / how I can help and if you
would be happy to include feature engine in the contrib page.

Have a good rest of week

Sole

On Mon, 15 Apr 2019 at 15:56, Andreas Mueller <t3kcit at gmail.com> wrote:

> 1) was indeed a design decision. Your design is certainly an alternative
> design, that might be more convenient in some situations,
> but requires adding this feature to all transformers, which basically just
> adds a bunch of boilerplate code everywhere.
> So you could argue our design decision was more driven by ease of
> maintenance than ease of use.
>
> There might be some transformers in your package that we could add to
> scikit-learn in some form, but several are already available,
> SimpleImputer implements MedianMeanImputer, CategoricalVariableImputer and
> FrequentCategoryImputer
> We don't currently have RandomSampleImputer and EndTailImputer, I think.
> AddNaNBinaryImputer is "MissingIndicator" in sklearn.
>
> OneHotCategoricalEncoder and OrdinalEncoder exist,
> CountFrequencyCategoricalEncoder and MeanCategoriclaEncoder are in the
> works,
> though there are some arguments about the details. These are also in the
> categorical-encoding package:
> http://contrib.scikit-learn.org/categorical-encoding/
>
> RareLabelCategoricalEncoder is something I definitely want in
> OneHotEncoder, not sure if there's a PR yet.
>
> Do you have examples of WoERatioCategoricalEncoder or Windsorizer or any
> of the discretizers actually working well in practice?
> I have not seen them used much, they seemed to be popular in Weka, though.
>
> BoxCoxTransformer is implemented in PowerTransformer, and LogTransformer,
> ReciprocalTransformer and ExponentialTransformer can be
> implemented as FunctionTransformer(np.log), FunctionTransformer(lambda x:
> 1/x) and FunctionTransformer(lambda x: x ** exp) I believe.
>
> It might be interesting to add your package to scikit-learn-contrib:
> https://github.com/scikit-learn-contrib
>
> We are struggling a bit with how to best organize that, though.
>
> Cheers,
> Andy
>
>
> On 4/10/19 2:13 PM, Sole Galli wrote:
>
> Hi Nicolas,
>
> You are right, I am just checking this in the source code.
>
> Sorry for the confusion and thanks for the quick response
>
> Cheers
>
> Sole
>
> On Wed, 10 Apr 2019 at 18:43, Nicolas Goix <goix.nicolas at gmail.com> wrote:
>
>> Hi Sole,
>>
>> I'm not sure the 2 limitations you mentioned are correct.
>> 1) in your example, using the ColumnTransformer you can impute different
>> values for different columns.
>> 2) the sklearn transformers do learn on the training set and are able to
>> perpetuate the values learnt from the train set to unseen data.
>>
>> Nicolas
>>
>> On Wed, Apr 10, 2019, 18:25 Sole Galli <solegalli1 at gmail.com> wrote:
>>
>>> Dear Scikit-Learn team,
>>>>
>>>> Feature engineering is a big task ahead of building machine learning
>>>> models. It involves imputation of missing values, encoding of categorical
>>>> variables, discretisation, variable transformation etc.
>>>>
>>>> Sklearn includes some functionality for feature engineering, which is
>>>> useful, but it has a few limitations:
>>>>
>>>> 1) it does not allow for feature specification - it will do the same
>>>> process on all variables, for example SimpleImputer. Typically, we
>>>> want to impute different columns with different values.
>>>> 2) It does not capture information from the training set, this is it
>>>> does not learn, therefore, it is not able to perpetuate the values learnt
>>>> from the train set, to unseen data.
>>>>
>>>> The 2 limitations above apply to all the feature transformers in
>>>> sklearn, I believe.
>>>>
>>>> Therefore, if these transformers are used as part of a pipeline, we
>>>> could end up doing different transformations to train and test, depending
>>>> on the characteristics of the datasets. For business purposes, this is not
>>>> a desired option.
>>>>
>>>> I think that building transformers that learn from the train set would
>>>> be of much use for the community.
>>>>
>>>> To this end, I built a python package called feature engine
>>>> <https://pypi.org/project/feature-engine/> which expands the
>>>> sklearn-api with additional feature engineering techniques, and the
>>>> functionality that allows the transformer to learn from data and store the
>>>> parameters learnt.
>>>>
>>>> The techniques included have been used worldwide, both in business and
>>>> in data competitions, and reported in kdd reports and other articles. I
>>>> also cover them in an udemy course
>>>> <https://www.udemy.com/feature-engineering-for-machine-learning> which
>>>> has enrolled several thousand students.
>>>>
>>>> The package capitalises on the use of pandas to capture the features,
>>>> but I am confident that the columns names could be captured and the df
>>>> transformed to a numpy array to comply with sklearn requirements.
>>>>
>>>> I wondered whether it would be of interest to include the functionality
>>>> of this package within sklearn?
>>>> If you would consider extending the sklearn api to include these
>>>> transformers, I would be happy to help.
>>>>
>>>> Alternatively, would you consider to add the package to your website?
>>>> where you mention the libaries that extend sklearn functionality?
>>>>
>>>> All feedback is welcome.
>>>>
>>>> Many thanks and I look forward to hearing from you
>>>>
>>>> Thank you so much fur such an awesome contribution through the sklearn
>>>> api
>>>>
>>>> Kind regards
>>>>
>>>> Sole
>>>>
>>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190424/56ea6ab6/attachment-0001.html>