[Feature] drop_one in one hot encoder
Hy Sci-kittens! :-) I was doing machine learning a-z course on Udemy, there they told that every time one-hot encoding is done, one of the columns should be dropped as it is like doubling same category twice and redundant to model. I thought if instead of having user find the index and drop it after preprocessing, OneHotEncoder had a drop_one variable, and it automatically removed the last column. What are your thoughts about this? I am new to this community, would like to contribute this myself if it is possible addition. Thanks, Trion129
Hi, hm, I think that dropping a column in onehot encoded features is quite uncommon in machine learning practice -- based on the applications and implementations I've seen. My guess is that the onehot encoded features are multicolinear anyway!? There may be certain algorithms that benefit from dropping a column, though (e.g., linear regression as a simple example). For instance, pandas' get_dummies has a "drop_first" parameter ... I think it would make sense to have such a parameter in the onehotencoder as well, e.g., for working with pipelines. Best, Sebastian
On Jun 25, 2017, at 7:48 AM, Parminder Singh <parmsingh129@gmail.com> wrote:
Hy Sci-kittens! :-)
I was doing machine learning a-z course on Udemy, there they told that every time one-hot encoding is done, one of the columns should be dropped as it is like doubling same category twice and redundant to model. I thought if instead of having user find the index and drop it after preprocessing, OneHotEncoder had a drop_one variable, and it automatically removed the last column. What are your thoughts about this? I am new to this community, would like to contribute this myself if it is possible addition.
Thanks, Trion129 _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
On Sun, Jun 25, 2017 at 05:18:09PM +0530, Parminder Singh wrote:
Hy Sci-kittens! :-)
Nice :). FYI: there is work in progress to replace the OneHotEncoder, as it has many strong limitations: https://github.com/scikit-learn/scikit-learn/pull/9151 It might be useful to have a look at this PR to make sure that it solves the various use cases. Gaƫl
I was doing machine learning a-z course on Udemy, there they told that every time one-hot encoding is done, one of the columns should be dropped as it is like doubling same category twice and redundant to model. I thought if instead of having user find the index and drop it after preprocessing, OneHotEncoder had a drop_one variable, and it automatically removed the last column. What are your thoughts about this? I am new to this community, would like to contribute this myself if it is possible addition.
Thanks, Trion129 _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- Gael Varoquaux Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux
participants (3)
-
Gael Varoquaux -
Parminder Singh -
Sebastian Raschka