[scikit-learn] One-hot encoding

Fernando Marcos Wittmann fernando.wittmann at gmail.com
Fri Aug 3 07:52:53 EDT 2018


Hi Sarah, I have some reflection questions. You don't need to answer  all
of them :) how many categories (approximately) do you have in each of those
20M categorical variables? How many samples do you have? Maybe you should
consider different encoding strategies such as binary encoding. Also, this
looks like a big data problem. Have you considered using distributed
computing? Also, do you really need to use all of those 20M variables in
your first approach? Consider using feature selection techniques. I would
suggest that you start with something simpler with less features and that
run more easily in your machine. Then later you can starting adding more
complexity if necessary. Keep in mind that if the number of samples is
lower than the number of columns after one hot encoding, you might face
overfitting. Try to always have less columns than the number of samples.

On Aug 2, 2018 12:53, "Sarah Wait Zaranek" <sarah.zaranek at gmail.com> wrote:

Hi Joel -

Are you sure?  I ran it and it actually uses bit more memory instead of
less, same code just run with a different docker container.

Max memory used by a single task: 50.41GB
vs
Max memory used by a single task: 51.15GB

Cheers,
Sarah

On Wed, Aug 1, 2018 at 7:19 PM, Sarah Wait Zaranek <sarah.zaranek at gmail.com>
wrote:

> In the developer version, yes? Looking for the new memory savings :)
>
> On Wed, Aug 1, 2018, 17:29 Joel Nothman <joel.nothman at gmail.com> wrote:
>
>> Use OneHotEncoder
>>
>
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180803/72a9e8fd/attachment.html>


More information about the scikit-learn mailing list