[scikit-learn] One-hot encoding

Joel Nothman joel.nothman at gmail.com
Mon Feb 5 00:56:25 EST 2018


If you specify n_values=[list_of_vals_for_column1,
list_of_vals_for_column2], you should be able to engineer it to how you
want.

On 5 February 2018 at 16:31, Sarah Wait Zaranek <sarah.zaranek at gmail.com>
wrote:

> If I use the n+1 approach, then I get the correct matrix, except with the
> columns of zeros:
>
> >>> test
> array([[0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1.],
>        [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.],
>        [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.],
>        [0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.]])
>
>
> On Mon, Feb 5, 2018 at 12:25 AM, Sarah Wait Zaranek <
> sarah.zaranek at gmail.com> wrote:
>
>> Hi Joel -
>>
>> Conceptually, that makes sense.  But when I assign n_values, I can't make
>> it match the result when you don't specify them. See below.  I used the
>> number of unique levels per column.
>>
>> >>> enc = OneHotEncoder(sparse=False)
>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, 2]])
>> >>> test
>> array([[0., 0., 1., 1., 0., 0., 0., 0., 1.],
>>        [0., 1., 0., 0., 1., 1., 0., 0., 0.],
>>        [1., 0., 0., 0., 1., 0., 1., 0., 0.],
>>        [0., 1., 0., 1., 0., 0., 0., 1., 0.]])
>> >>> enc = OneHotEncoder(sparse=False,n_values=[3,2,4])
>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, 2]])
>> >>> test
>> array([[0., 0., 0., 1., 0., 0., 0., 1., 1.],
>>        [0., 1., 0., 0., 0., 2., 0., 0., 0.],
>>        [1., 0., 0., 0., 0., 1., 1., 0., 0.],
>>        [0., 1., 0., 1., 0., 0., 0., 1., 0.]])
>>
>> Cheers,
>> Sarah
>>
>> Cheers,
>> Sarah
>>
>> On Mon, Feb 5, 2018 at 12:02 AM, Joel Nothman <joel.nothman at gmail.com>
>> wrote:
>>
>>> If each input column is encoded as a value from 0 to the (number of
>>> possible values for that column - 1) then n_values for that column should
>>> be the highest value + 1, which is also the number of levels per column.
>>> Does that make sense?
>>>
>>> Actually, I've realised there's a somewhat slow and unnecessary bit of
>>> code in the one-hot encoder: where the COO matrix is converted to CSR. I
>>> suspect this was done because most of our ML algorithms perform better on
>>> CSR, or else to maintain backwards compatibility with an earlier
>>> implementation.
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180205/02f33246/attachment-0001.html>


More information about the scikit-learn mailing list