<div dir="auto"><div>Hi Sole,</div><div dir="auto"><br></div><div dir="auto">I'm not sure the 2 limitations you mentioned are correct.</div><div dir="auto">1) in your example, using the ColumnTransformer you can impute different values for different columns.</div><div dir="auto">2) the sklearn transformers do learn on the training set and are <span style="font-family:sans-serif">able to perpetuate the values learnt from the train set to unseen data.</span><br><br>Nicolas<br><br><div class="gmail_quote" dir="auto"><div dir="ltr" class="gmail_attr">On Wed, Apr 10, 2019, 18:25 Sole Galli <<a href="mailto:solegalli1@gmail.com">solegalli1@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Dear Scikit-Learn team,<div><br></div><div>Feature engineering is a big task ahead of building machine learning models. It involves imputation of missing values, encoding of categorical variables, discretisation, variable transformation etc.</div><div><br></div><div>Sklearn includes some functionality for feature engineering, which is useful, but it has a few limitations:</div><div><br></div><div>1) it does not allow for feature specification - it will do the same process on all variables, for example <span style="background-color:rgb(248,248,248);font-family:monospace;font-size:14.04px;font-weight:700;white-space:nowrap">SimpleImputer</span>. Typically, we want to impute different columns with different values. </div><div>2) It does not capture information from the training set, this is it does not learn, therefore, it is not able to perpetuate the values learnt from the train set, to unseen data. </div><div><br></div><div>The 2 limitations above apply to all the feature transformers in sklearn, I believe.</div><div><br></div><div>Therefore, if these transformers are used as part of a pipeline, we could end up doing different transformations to train and test, depending on the characteristics of the datasets. For business purposes, this is not a desired option.</div><div> <br></div><div>I think that building transformers that learn from the train set would be of much use for the community.</div><div><br></div><div>To this end, I built a python package called <a href="https://pypi.org/project/feature-engine/" target="_blank" rel="noreferrer">feature engine</a> which expands the sklearn-api with additional feature engineering techniques, and the functionality that allows the transformer to learn from data and store the parameters learnt.</div><div><br></div><div>The techniques included have been used worldwide, both in business and in data competitions, and reported in kdd reports and other articles. I also cover them in an <a href="https://www.udemy.com/feature-engineering-for-machine-learning" target="_blank" rel="noreferrer">udemy course</a> which has enrolled several thousand students.</div><div><br></div><div>The package capitalises on the use of pandas to capture the features, but I am confident that the columns names could be captured and the df transformed to a numpy array to comply with sklearn requirements.</div><div><br></div><div>I wondered whether it would be of interest to include the functionality of this package within sklearn? </div><div>If you would consider extending the sklearn api to include these transformers, I would be happy to help.</div><div><br></div><div>Alternatively, would you consider to add the package to your website? where you mention the libaries that extend sklearn functionality? <br></div><div><br></div><div>All feedback is welcome.<br></div><div><br></div><div><div>Many thanks and I look forward to hearing from you</div><br class="m_-2787418305818367306gmail-m_-3188959306012726573gmail-Apple-interchange-newline"></div><div>Thank you so much fur such an awesome contribution through the sklearn api<br></div><div><br></div><div>Kind regards</div><div><br></div><div>Sole</div><div><br></div></div>
</blockquote></div></div>
_______________________________________________<br>
scikit-learn mailing list<br>
<a href="mailto:scikit-learn@python.org" target="_blank" rel="noreferrer">scikit-learn@python.org</a><br>
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/scikit-learn</a><br>
</blockquote></div></div></div>