[scikit-learn] One-hot encoding

Joel Nothman joel.nothman at gmail.com
Sun Feb 4 23:27:37 EST 2018


20 million categories, or 20 million categorical variables?

OneHotEncoder is pretty efficient if you specify n_values.

On 5 February 2018 at 15:10, Sarah Wait Zaranek <sarah.zaranek at gmail.com>
wrote:

> Hello -
>
> I was just wondering if there was a way to improve performance on the
> one-hot encoder.  Or, is there any plans to do so in the future?  I am
> working with a matrix that will ultimately have 20 million categorical
> variables, and my bottleneck is the one-hot encoder.
>
> Let me know if this isn't the place to inquire.  My code is very simple
> when using the encoder, but I cut and pasted it here for completeness.
>
>     enc = OneHotEncoder(sparse=True)
>     Xtrain = enc.fit_transform(tiledata)
>
>
> Thanks,
> Sarah
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180205/8d272c6f/attachment.html>


More information about the scikit-learn mailing list