[scikit-learn] One-hot encoding
Sarah Wait Zaranek
sarah.zaranek at gmail.com
Wed Aug 1 15:11:35 EDT 2018
Hello,
I have installed the dev version (0.20.dev0), should I just use Categorical
Encoder or is the functionality already rolled up into OneHotEncoder. I get
the following message:
File "", line 1, in
File "/scikit-learn/sklearn/preprocessing/data.py", line 2839, in *init*
"CategoricalEncoder briefly existed in 0.20dev. Its functionality "
RuntimeError: CategoricalEncoder briefly existed in 0.20dev. Its
functionality has been rolled into the OneHotEncoder and OrdinalEncoder.
This stub will be removed in version 0.21.
Cheers,
Sarah
On Mon, Feb 5, 2018 at 10:46 PM, Sarah Wait Zaranek <sarah.zaranek at gmail.com
> wrote:
> Thanks, this makes sense. I will try using the CategoricalEncoder to see
> the difference. It wouldn't be such a big deal if my input matrix wasn't so
> large. Thanks again for all your help.
>
> Cheers,
> Sarah
>
> On Mon, Feb 5, 2018 at 10:33 PM, Joel Nothman
> <joel.nothman at gmail.com> wrote:
>
>> Yes, the output CSR representation requires:
>> 1 (dtype) value per entry
>> 1 int32 per entry
>> 1 int32 per row
>>
>> The intermediate COO representation requires:
>> 1 (dtype) value per entry
>> 2 int32 per entry
>>
>> So as long as the transformation from COO to CSR is done over the whole
>> data, it will occupy roughly 5x the input size, which is exactly what you
>> are experienciong.
>>
>> The CategoricalEncoder currently available in the development version of
>> scikit-learn does not have this problem, but might be slower due to
>> handling non-integer categories. It will also possibly disappear and be
>> merged into OneHotEncoder soon (see PR #10523).
>>
>>
>>
>> On 6 February 2018 at 13:53, Sarah Wait Zaranek <sarah.zaranek at gmail.com>
>> wrote:
>>
>>> Yes, of course. What I mean is the I start out with 19 Gigs (initial
>>> matrix size) or so, it balloons to 100 Gigs *within the encoder function*
>>> and returns 28 Gigs (sparse one-hot matrix size). These numbers aren't
>>> exact, but you can see my point.
>>>
>>> Cheers,
>>> Sarah
>>>
>>> On Mon, Feb 5, 2018 at 9:50 PM, Joel Nothman <joel.nothman at gmail.com>
>>> wrote:
>>>
>>>> OneHotEncoder will not magically reduce the size of your input. It will
>>>> necessarily increase the memory of the input data as long as we are storing
>>>> the results in scipy.sparse matrices. The sparse representation will be
>>>> less expensive than the dense representation, but it won't be less
>>>> expensive than the input.
>>>>
>>>> On 6 February 2018 at 13:24, Sarah Wait Zaranek <
>>>> sarah.zaranek at gmail.com> wrote:
>>>>
>>>>> Hi Joel -
>>>>>
>>>>> I am also seeing a huge overhead in memory for calling the
>>>>> onehot-encoder. I have hacked it by running it splitting by matrix into
>>>>> 4-5 smaller matrices (by columns) and then concatenating the results. But,
>>>>> I am seeing upwards of 100 Gigs overhead. Should I file a bug report? Or
>>>>> is this to be expected.
>>>>>
>>>>> Cheers,
>>>>> Sarah
>>>>>
>>>>> On Mon, Feb 5, 2018 at 1:05 AM, Sarah Wait Zaranek <
>>>>> sarah.zaranek at gmail.com> wrote:
>>>>>
>>>>>> Great. Thank you for all your help.
>>>>>>
>>>>>> Cheers,
>>>>>> Sarah
>>>>>>
>>>>>> On Mon, Feb 5, 2018 at 12:56 AM, Joel Nothman <joel.nothman at gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> If you specify n_values=[list_of_vals_for_column1,
>>>>>>> list_of_vals_for_column2], you should be able to engineer it to how you
>>>>>>> want.
>>>>>>>
>>>>>>> On 5 February 2018 at 16:31, Sarah Wait Zaranek <
>>>>>>> sarah.zaranek at gmail.com> wrote:
>>>>>>>
>>>>>>>> If I use the n+1 approach, then I get the correct matrix, except
>>>>>>>> with the columns of zeros:
>>>>>>>>
>>>>>>>> >>> test
>>>>>>>> array([[0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1.],
>>>>>>>> [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.],
>>>>>>>> [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.],
>>>>>>>> [0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1.,
>>>>>>>> 0.]])
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Feb 5, 2018 at 12:25 AM, Sarah Wait Zaranek <
>>>>>>>> sarah.zaranek at gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Joel -
>>>>>>>>>
>>>>>>>>> Conceptually, that makes sense. But when I assign n_values, I
>>>>>>>>> can't make it match the result when you don't specify them. See below. I
>>>>>>>>> used the number of unique levels per column.
>>>>>>>>>
>>>>>>>>> >>> enc = OneHotEncoder(sparse=False)
>>>>>>>>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1,
>>>>>>>>> 0, 2]])
>>>>>>>>> >>> test
>>>>>>>>> array([[0., 0., 1., 1., 0., 0., 0., 0., 1.],
>>>>>>>>> [0., 1., 0., 0., 1., 1., 0., 0., 0.],
>>>>>>>>> [1., 0., 0., 0., 1., 0., 1., 0., 0.],
>>>>>>>>> [0., 1., 0., 1., 0., 0., 0., 1., 0.]])
>>>>>>>>> >>> enc = OneHotEncoder(sparse=False,n_values=[3,2,4])
>>>>>>>>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1,
>>>>>>>>> 0, 2]])
>>>>>>>>> >>> test
>>>>>>>>> array([[0., 0., 0., 1., 0., 0., 0., 1., 1.],
>>>>>>>>> [0., 1., 0., 0., 0., 2., 0., 0., 0.],
>>>>>>>>> [1., 0., 0., 0., 0., 1., 1., 0., 0.],
>>>>>>>>> [0., 1., 0., 1., 0., 0., 0., 1., 0.]])
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Sarah
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Sarah
>>>>>>>>>
>>>>>>>>> On Mon, Feb 5, 2018 at 12:02 AM, Joel Nothman <
>>>>>>>>> joel.nothman at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> If each input column is encoded as a value from 0 to the (number
>>>>>>>>>> of possible values for that column - 1) then n_values for that column
>>>>>>>>>> should be the highest value + 1, which is also the number of levels per
>>>>>>>>>> column. Does that make sense?
>>>>>>>>>>
>>>>>>>>>> Actually, I've realised there's a somewhat slow and unnecessary
>>>>>>>>>> bit of code in the one-hot encoder: where the COO matrix is converted to
>>>>>>>>>> CSR. I suspect this was done because most of our ML algorithms perform
>>>>>>>>>> better on CSR, or else to maintain backwards compatibility with an earlier
>>>>>>>>>> implementation.
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> scikit-learn mailing list
>>>>>>>>>> scikit-learn at python.org
>>>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> scikit-learn mailing list
>>>>>>>> scikit-learn at python.org
>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> scikit-learn mailing list
>>>>>>> scikit-learn at python.org
>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180801/cf60d0d4/attachment-0001.html>
More information about the scikit-learn
mailing list