Why does sklearn require one-hot-encoding for categorical features? Can we have a "factor" data type?
Hello everyone, I am frustrated with the one-hot-encoding requirement for categorical feature. Why? I've used R and Stata software, none needs such transformation. They have a data type called "factors", which is different from "numeric". My problem with OHE: One-hot-encoding results in large number of features. This really blows up quickly. And I have to fight curse of dimensionality with PCA reduction. That's not cool! Can sklearn have a "factor" data type in the future? It would make life so much easier. Thanks a lot!
Hi, I think there are many reasons that have led to the current situation. One is that scikit-learn is based on numpy arrays, which do not offer categorical data types (yet: ideas are being discussed https://numpy.org/neps/nep-0041-improved-dtype-support.html Pandas already has a categorical data type https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html) For algorithms like random forests, having categorical variables would be absolutely great. Another reason might be different communities handling categorical data in different ways traditionally. One-hot-encoding is more common on the ML side than on the stats side for instance. To your point:
One-hot-encoding results in large number of features. This really blows up quickly. And I have to fight curse of dimensionality with PCA reduction. That's not cool!
Depending on the algorithm being used, a categorical variable may or may not need to be expanded into one-hot dimension encoding under the hood, so the potential gain of having such a data encoding method is highly dependent on the algorithms used. Hope this helps! Michael On Thu, Apr 30, 2020 at 3:57 PM C W <tmrsg11@gmail.com> wrote:
Hello everyone,
I am frustrated with the one-hot-encoding requirement for categorical feature. Why?
I've used R and Stata software, none needs such transformation. They have a data type called "factors", which is different from "numeric".
My problem with OHE: One-hot-encoding results in large number of features. This really blows up quickly. And I have to fight curse of dimensionality with PCA reduction. That's not cool!
Can sklearn have a "factor" data type in the future? It would make life so much easier.
Thanks a lot!
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
On Thu, Apr 30, 2020 at 03:55:00PM -0400, C W wrote:
I've used R and Stata software, none needs such transformation. They have a data type called "factors", which is different from "numeric".
My problem with OHE: One-hot-encoding results in large number of features. This really blows up quickly. And I have to fight curse of dimensionality with PCA reduction. That's not cool!
Most statistical models still not one-hot encoding behind the hood. So, R and stata do it too. Typically, tree-based models can be adapted to work directly on categorical data. Ours don't. It's work in progress. G
Perhaps pd.factorize could hello? Obtener Outlook para Android<https://aka.ms/ghei36> ________________________________ From: scikit-learn <scikit-learn-bounces+paisanohermes=hotmail.com@python.org> on behalf of Gael Varoquaux <gael.varoquaux@normalesup.org> Sent: Thursday, April 30, 2020 5:12:06 PM To: Scikit-learn mailing list <scikit-learn@python.org> Subject: Re: [scikit-learn] Why does sklearn require one-hot-encoding for categorical features? Can we have a "factor" data type? On Thu, Apr 30, 2020 at 03:55:00PM -0400, C W wrote:
I've used R and Stata software, none needs such transformation. They have a data type called "factors", which is different from "numeric".
My problem with OHE: One-hot-encoding results in large number of features. This really blows up quickly. And I have to fight curse of dimensionality with PCA reduction. That's not cool!
Most statistical models still not one-hot encoding behind the hood. So, R and stata do it too. Typically, tree-based models can be adapted to work directly on categorical data. Ours don't. It's work in progress. G _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.pytho...
Hermes, That's an interesting function. Does it work with sklearn after factorize? Is there any example? Thanks! On Thu, Apr 30, 2020 at 6:51 PM Hermes Morales <paisanohermes@hotmail.com> wrote:
Perhaps pd.factorize could hello?
Obtener Outlook para Android <https://aka.ms/ghei36>
------------------------------ *From:* scikit-learn <scikit-learn-bounces+paisanohermes= hotmail.com@python.org> on behalf of Gael Varoquaux < gael.varoquaux@normalesup.org> *Sent:* Thursday, April 30, 2020 5:12:06 PM *To:* Scikit-learn mailing list <scikit-learn@python.org> *Subject:* Re: [scikit-learn] Why does sklearn require one-hot-encoding for categorical features? Can we have a "factor" data type?
On Thu, Apr 30, 2020 at 03:55:00PM -0400, C W wrote:
I've used R and Stata software, none needs such transformation. They have a data type called "factors", which is different from "numeric".
My problem with OHE: One-hot-encoding results in large number of features. This really blows up quickly. And I have to fight curse of dimensionality with PCA reduction. That's not cool!
Most statistical models still not one-hot encoding behind the hood. So, R and stata do it too.
Typically, tree-based models can be adapted to work directly on categorical data. Ours don't. It's work in progress.
G _______________________________________________ scikit-learn mailing list scikit-learn@python.org
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.pytho... _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
OrdinalEncoder is the equivalent of pd.factorize and will work in the scikit-learn ecosystem. However, be aware that you should not just swap OneHotEncoder to OrdinalEncoder just at your wish. It depends of your machine learning pipeline. As mentioned by Gael, tree-based algorithm will be fine with OrdinalEncoder. If you have a linear model, then you need to use the OneHotEncoder if the categories do not have any order. I will just refer to one notebook that we taught in EuroScipy last year: https://github.com/lesteve/euroscipy-2019-scikit-learn-tutorial/blob/master/... On Fri, 1 May 2020 at 05:11, C W <tmrsg11@gmail.com> wrote:
Hermes,
That's an interesting function. Does it work with sklearn after factorize? Is there any example? Thanks!
On Thu, Apr 30, 2020 at 6:51 PM Hermes Morales <paisanohermes@hotmail.com> wrote:
Perhaps pd.factorize could hello?
Obtener Outlook para Android <https://aka.ms/ghei36>
------------------------------ *From:* scikit-learn <scikit-learn-bounces+paisanohermes= hotmail.com@python.org> on behalf of Gael Varoquaux < gael.varoquaux@normalesup.org> *Sent:* Thursday, April 30, 2020 5:12:06 PM *To:* Scikit-learn mailing list <scikit-learn@python.org> *Subject:* Re: [scikit-learn] Why does sklearn require one-hot-encoding for categorical features? Can we have a "factor" data type?
On Thu, Apr 30, 2020 at 03:55:00PM -0400, C W wrote:
I've used R and Stata software, none needs such transformation. They have a data type called "factors", which is different from "numeric".
My problem with OHE: One-hot-encoding results in large number of features. This really blows up quickly. And I have to fight curse of dimensionality with PCA reduction. That's not cool!
Most statistical models still not one-hot encoding behind the hood. So, R and stata do it too.
Typically, tree-based models can be adapted to work directly on categorical data. Ours don't. It's work in progress.
G _______________________________________________ scikit-learn mailing list scikit-learn@python.org
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.pytho... _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/
Thank you for the link, Guilaumme. In my particular case, I am working on random forest classification. The notebook seems great. I will have to go through it in detail. I'm still fairly new at using sklearn. Thank you for everyone's quick response, always feeling loved on here! :) On Fri, May 1, 2020 at 4:00 AM Guillaume Lemaître <g.lemaitre58@gmail.com> wrote:
OrdinalEncoder is the equivalent of pd.factorize and will work in the scikit-learn ecosystem.
However, be aware that you should not just swap OneHotEncoder to OrdinalEncoder just at your wish. It depends of your machine learning pipeline.
As mentioned by Gael, tree-based algorithm will be fine with OrdinalEncoder. If you have a linear model, then you need to use the OneHotEncoder if the categories do not have any order.
I will just refer to one notebook that we taught in EuroScipy last year:
https://github.com/lesteve/euroscipy-2019-scikit-learn-tutorial/blob/master/...
On Fri, 1 May 2020 at 05:11, C W <tmrsg11@gmail.com> wrote:
Hermes,
That's an interesting function. Does it work with sklearn after factorize? Is there any example? Thanks!
On Thu, Apr 30, 2020 at 6:51 PM Hermes Morales <paisanohermes@hotmail.com> wrote:
Perhaps pd.factorize could hello?
Obtener Outlook para Android <https://aka.ms/ghei36>
------------------------------ *From:* scikit-learn <scikit-learn-bounces+paisanohermes= hotmail.com@python.org> on behalf of Gael Varoquaux < gael.varoquaux@normalesup.org> *Sent:* Thursday, April 30, 2020 5:12:06 PM *To:* Scikit-learn mailing list <scikit-learn@python.org> *Subject:* Re: [scikit-learn] Why does sklearn require one-hot-encoding for categorical features? Can we have a "factor" data type?
On Thu, Apr 30, 2020 at 03:55:00PM -0400, C W wrote:
I've used R and Stata software, none needs such transformation. They have a data type called "factors", which is different from "numeric".
My problem with OHE: One-hot-encoding results in large number of features. This really blows up quickly. And I have to fight curse of dimensionality with PCA reduction. That's not cool!
Most statistical models still not one-hot encoding behind the hood. So, R and stata do it too.
Typically, tree-based models can be adapted to work directly on categorical data. Ours don't. It's work in progress.
G _______________________________________________ scikit-learn mailing list scikit-learn@python.org
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.pytho... _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/ _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
That's an excellent discussion! I've always wondered how other tools like R handled naturally categorical variables or not. LightGBM has a scikit-learn API which handles categorical features by inputting their columns names (or indexes): ``` import lightgbm lgb=lightgbm.LGBMClassifier() lgb.fit(*X*, *y*, *feature_name=... *, *categorical_feature=... *) ``` Where: - feature_name (list of strings or 'auto', optional (default='auto')) – Feature names. If ‘auto’ and data is pandas DataFrame, data columns names are used. - categorical_feature (list of strings or int, or 'auto', optional (default='auto')) – Categorical features. If list of int, interpreted as indices. If list of strings, interpreted as feature names (need to specify feature_name as well). If ‘auto’ and data is pandas DataFrame, pandas unordered categorical columns are used. All values in categorical features should be less than int32 max value (2147483647). As a suggestion, Scikit-Learn could add a `categorical_feature` parameter in the tree-based estimators in order to work on the same way. On Fri, May 1, 2020 at 12:54 PM C W <tmrsg11@gmail.com> wrote:
Thank you for the link, Guilaumme. In my particular case, I am working on random forest classification.
The notebook seems great. I will have to go through it in detail. I'm still fairly new at using sklearn.
Thank you for everyone's quick response, always feeling loved on here! :)
On Fri, May 1, 2020 at 4:00 AM Guillaume Lemaître <g.lemaitre58@gmail.com> wrote:
OrdinalEncoder is the equivalent of pd.factorize and will work in the scikit-learn ecosystem.
However, be aware that you should not just swap OneHotEncoder to OrdinalEncoder just at your wish. It depends of your machine learning pipeline.
As mentioned by Gael, tree-based algorithm will be fine with OrdinalEncoder. If you have a linear model, then you need to use the OneHotEncoder if the categories do not have any order.
I will just refer to one notebook that we taught in EuroScipy last year:
https://github.com/lesteve/euroscipy-2019-scikit-learn-tutorial/blob/master/...
On Fri, 1 May 2020 at 05:11, C W <tmrsg11@gmail.com> wrote:
Hermes,
That's an interesting function. Does it work with sklearn after factorize? Is there any example? Thanks!
On Thu, Apr 30, 2020 at 6:51 PM Hermes Morales < paisanohermes@hotmail.com> wrote:
Perhaps pd.factorize could hello?
Obtener Outlook para Android <https://aka.ms/ghei36>
------------------------------ *From:* scikit-learn <scikit-learn-bounces+paisanohermes= hotmail.com@python.org> on behalf of Gael Varoquaux < gael.varoquaux@normalesup.org> *Sent:* Thursday, April 30, 2020 5:12:06 PM *To:* Scikit-learn mailing list <scikit-learn@python.org> *Subject:* Re: [scikit-learn] Why does sklearn require one-hot-encoding for categorical features? Can we have a "factor" data type?
On Thu, Apr 30, 2020 at 03:55:00PM -0400, C W wrote:
I've used R and Stata software, none needs such transformation. They have a data type called "factors", which is different from "numeric".
My problem with OHE: One-hot-encoding results in large number of features. This really blows up quickly. And I have to fight curse of dimensionality with PCA reduction. That's not cool!
Most statistical models still not one-hot encoding behind the hood. So, R and stata do it too.
Typically, tree-based models can be adapted to work directly on categorical data. Ours don't. It's work in progress.
G _______________________________________________ scikit-learn mailing list scikit-learn@python.org
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.pytho... _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/ _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
When it comes to trees, the API for handling categoricals is simpler than the implementation. Traditionally, tree-based models' handling of categorical variables differs from both ordinal and one-hot encoding, while both of those will work reasonably well for many problems. We are working on implementing categorical handling in trees ( https://github.com/scikit-learn/scikit-learn/issues/15550, https://github.com/scikit-learn/scikit-learn/pull/12866)...
participants (7)
-
C W -
Fernando Marcos Wittmann -
Gael Varoquaux -
Guillaume Lemaître -
Hermes Morales -
Joel Nothman -
Michael Eickenberg