[scikit-learn] One-hot encoding

Fri Aug 3 08:20:10 EDT 2018

Hi all -

I can't do binary encoding because I need to trace back to the exact
categorical variable and that is difficult in binary encoding, I believe.
Each categorical variable has a range, but on average it is about 10
categories. I return a sparse matrix from the encoder.  Regardless of the
encoding strategy, the issue is the overhead of the encoding itself not the
resulting encoded matrix so using an encoding which is slightly smaller
isn't going to solve my issue as far as I am aware.  We have done the tests
just with the integer representation of the categorical variable and the
results are unsatisfying.  If there is an encoder that isn't lossy - that I
can get my original category back and not see as large a memory requirement
as with the one-hot creation -- I am happy to try it out.

Yes, I need this many variables, and there are all categorical.  In my
world, we have short and wide matrices-- it is very common.
Unfortunately,  I need to do feature selection techniques on the encoded
version of the data - which I could possible do in parts for so of the
feature selection techniques but for the ones I really want to use I need
the entire matrix (think very large ReliefF).  I already have a working
version on my machine with less data (and btw my "machine" is one of the
biggest instances available in my region with 400GB+ of RAM).  I am
eventually moving to using a distributed computing solution (mLlib+Spark),
but I wanted to see what I could do in scikit-learn before I went there.
Of course, I am aware of overfitting issues-- we do regularization and
cross validation, etc.   I just thought it was unfortunately that the thing
holding my analysis back from using scikit-learn wasn't the machine
learning but the encoding algorithm, memory requirements.

Cheers,
Sarah

On Fri, Aug 3, 2018 at 7:52 AM, Fernando Marcos Wittmann <
fernando.wittmann at gmail.com> wrote:

> Hi Sarah, I have some reflection questions. You don't need to answer  all
> of them :) how many categories (approximately) do you have in each of those
> 20M categorical variables? How many samples do you have? Maybe you should
> consider different encoding strategies such as binary encoding. Also, this
> looks like a big data problem. Have you considered using distributed
> computing? Also, do you really need to use all of those 20M variables in
> your first approach? Consider using feature selection techniques. I would
> suggest that you start with something simpler with less features and that
> run more easily in your machine. Then later you can starting adding more
> complexity if necessary. Keep in mind that if the number of samples is
> lower than the number of columns after one hot encoding, you might face
> overfitting. Try to always have less columns than the number of samples.
>
> On Aug 2, 2018 12:53, "Sarah Wait Zaranek" <sarah.zaranek at gmail.com>
> wrote:
>
> Hi Joel -
>
> Are you sure?  I ran it and it actually uses bit more memory instead of
> less, same code just run with a different docker container.
>
> Max memory used by a single task: 50.41GB
> vs
> Max memory used by a single task: 51.15GB
>
> Cheers,
> Sarah
>
> On Wed, Aug 1, 2018 at 7:19 PM, Sarah Wait Zaranek <
> sarah.zaranek at gmail.com> wrote:
>
>> In the developer version, yes? Looking for the new memory savings :)
>>
>> On Wed, Aug 1, 2018, 17:29 Joel Nothman <joel.nothman at gmail.com> wrote:
>>
>>> Use OneHotEncoder
>>>
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180803/be20f0d3/attachment.html>