[scikit-learn] One-hot encoding

Sarah Wait Zaranek sarah.zaranek at gmail.com
Mon Feb 5 21:53:21 EST 2018


Yes, of course.  What I mean is the I start out with 19 Gigs (initial
matrix size) or so, it balloons to 100 Gigs *within the encoder function*
and returns 28 Gigs (sparse one-hot matrix size).  These numbers aren't
exact, but you can see my point.

Cheers,
Sarah

On Mon, Feb 5, 2018 at 9:50 PM, Joel Nothman <joel.nothman at gmail.com> wrote:

> OneHotEncoder will not magically reduce the size of your input. It will
> necessarily increase the memory of the input data as long as we are storing
> the results in scipy.sparse matrices. The sparse representation will be
> less expensive than the dense representation, but it won't be less
> expensive than the input.
>
> On 6 February 2018 at 13:24, Sarah Wait Zaranek <sarah.zaranek at gmail.com>
> wrote:
>
>> Hi Joel -
>>
>> I am also seeing a huge overhead in memory for calling the
>> onehot-encoder.  I have hacked it by running it splitting by matrix into
>> 4-5 smaller matrices (by columns) and then concatenating the results.  But,
>> I am seeing upwards of 100 Gigs overhead. Should I file a bug report?  Or
>> is this to be expected.
>>
>> Cheers,
>> Sarah
>>
>> On Mon, Feb 5, 2018 at 1:05 AM, Sarah Wait Zaranek <
>> sarah.zaranek at gmail.com> wrote:
>>
>>> Great.  Thank you for all your help.
>>>
>>> Cheers,
>>> Sarah
>>>
>>> On Mon, Feb 5, 2018 at 12:56 AM, Joel Nothman <joel.nothman at gmail.com>
>>> wrote:
>>>
>>>> If you specify n_values=[list_of_vals_for_column1,
>>>> list_of_vals_for_column2], you should be able to engineer it to how you
>>>> want.
>>>>
>>>> On 5 February 2018 at 16:31, Sarah Wait Zaranek <
>>>> sarah.zaranek at gmail.com> wrote:
>>>>
>>>>> If I use the n+1 approach, then I get the correct matrix, except with
>>>>> the columns of zeros:
>>>>>
>>>>> >>> test
>>>>> array([[0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1.],
>>>>>        [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.],
>>>>>        [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.],
>>>>>        [0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.]])
>>>>>
>>>>>
>>>>> On Mon, Feb 5, 2018 at 12:25 AM, Sarah Wait Zaranek <
>>>>> sarah.zaranek at gmail.com> wrote:
>>>>>
>>>>>> Hi Joel -
>>>>>>
>>>>>> Conceptually, that makes sense.  But when I assign n_values, I can't
>>>>>> make it match the result when you don't specify them. See below.  I used
>>>>>> the number of unique levels per column.
>>>>>>
>>>>>> >>> enc = OneHotEncoder(sparse=False)
>>>>>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0,
>>>>>> 2]])
>>>>>> >>> test
>>>>>> array([[0., 0., 1., 1., 0., 0., 0., 0., 1.],
>>>>>>        [0., 1., 0., 0., 1., 1., 0., 0., 0.],
>>>>>>        [1., 0., 0., 0., 1., 0., 1., 0., 0.],
>>>>>>        [0., 1., 0., 1., 0., 0., 0., 1., 0.]])
>>>>>> >>> enc = OneHotEncoder(sparse=False,n_values=[3,2,4])
>>>>>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0,
>>>>>> 2]])
>>>>>> >>> test
>>>>>> array([[0., 0., 0., 1., 0., 0., 0., 1., 1.],
>>>>>>        [0., 1., 0., 0., 0., 2., 0., 0., 0.],
>>>>>>        [1., 0., 0., 0., 0., 1., 1., 0., 0.],
>>>>>>        [0., 1., 0., 1., 0., 0., 0., 1., 0.]])
>>>>>>
>>>>>> Cheers,
>>>>>> Sarah
>>>>>>
>>>>>> Cheers,
>>>>>> Sarah
>>>>>>
>>>>>> On Mon, Feb 5, 2018 at 12:02 AM, Joel Nothman <joel.nothman at gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> If each input column is encoded as a value from 0 to the (number of
>>>>>>> possible values for that column - 1) then n_values for that column should
>>>>>>> be the highest value + 1, which is also the number of levels per column.
>>>>>>> Does that make sense?
>>>>>>>
>>>>>>> Actually, I've realised there's a somewhat slow and unnecessary bit
>>>>>>> of code in the one-hot encoder: where the COO matrix is converted to CSR. I
>>>>>>> suspect this was done because most of our ML algorithms perform better on
>>>>>>> CSR, or else to maintain backwards compatibility with an earlier
>>>>>>> implementation.
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> scikit-learn mailing list
>>>>>>> scikit-learn at python.org
>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180205/17202afb/attachment.html>


More information about the scikit-learn mailing list