[scikit-learn] Why does sklearn require one-hot-encoding for categorical features? Can we have a "factor" data type?

C W tmrsg11 at gmail.com
Fri May 1 11:53:45 EDT 2020


Thank you for the link, Guilaumme. In my particular case, I am working on
random forest classification.

The notebook seems great. I will have to go through it in detail. I'm still
fairly new at using sklearn.

Thank you for everyone's quick response, always feeling loved on here! :)



On Fri, May 1, 2020 at 4:00 AM Guillaume Lemaître <g.lemaitre58 at gmail.com>
wrote:

> OrdinalEncoder is the equivalent of pd.factorize and will work in the
> scikit-learn ecosystem.
>
> However, be aware that you should not just swap OneHotEncoder to
> OrdinalEncoder just at your wish.
> It depends of your machine learning pipeline.
>
> As mentioned by Gael, tree-based algorithm will be fine with
> OrdinalEncoder. If you have a linear model,
> then you need to use the OneHotEncoder if the categories do not have any
> order.
>
> I will just refer to one notebook that we taught in EuroScipy last year:
>
> https://github.com/lesteve/euroscipy-2019-scikit-learn-tutorial/blob/master/rendered_notebooks/02_basic_preprocessing.ipynb
>
> On Fri, 1 May 2020 at 05:11, C W <tmrsg11 at gmail.com> wrote:
>
>> Hermes,
>>
>> That's an interesting function. Does it work with sklearn after
>> factorize?  Is there any example? Thanks!
>>
>> On Thu, Apr 30, 2020 at 6:51 PM Hermes Morales <paisanohermes at hotmail.com>
>> wrote:
>>
>>> Perhaps pd.factorize could hello?
>>>
>>> Obtener Outlook para Android <https://aka.ms/ghei36>
>>>
>>> ------------------------------
>>> *From:* scikit-learn <scikit-learn-bounces+paisanohermes=
>>> hotmail.com at python.org> on behalf of Gael Varoquaux <
>>> gael.varoquaux at normalesup.org>
>>> *Sent:* Thursday, April 30, 2020 5:12:06 PM
>>> *To:* Scikit-learn mailing list <scikit-learn at python.org>
>>> *Subject:* Re: [scikit-learn] Why does sklearn require one-hot-encoding
>>> for categorical features? Can we have a "factor" data type?
>>>
>>> On Thu, Apr 30, 2020 at 03:55:00PM -0400, C W wrote:
>>> > I've used R and Stata software, none needs such transformation. They
>>> have a
>>> > data type called "factors", which is different from "numeric".
>>>
>>> > My problem with OHE:
>>> > One-hot-encoding results in large number of features. This really
>>> blows up
>>> > quickly. And I have to fight curse of dimensionality with PCA
>>> reduction. That's
>>> > not cool!
>>>
>>> Most statistical models still not one-hot encoding behind the hood. So, R
>>> and stata do it too.
>>>
>>> Typically, tree-based models can be adapted to work directly on
>>> categorical data. Ours don't. It's work in progress.
>>>
>>> G
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>>
>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.python.org%2Fmailman%2Flistinfo%2Fscikit-learn&data=02%7C01%7C%7Ce7aa6f99b7914a1f84b208d7ed430801%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637238744453345410&sdata=e3BfHB4v5VFteeZ0Zh3FJ9Wcz9KmkUwur5i8Reue3mc%3D&reserved=0
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
> --
> Guillaume Lemaitre
> Scikit-learn @ Inria Foundation
> https://glemaitre.github.io/
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200501/86e25f6c/attachment.html>


More information about the scikit-learn mailing list