[scikit-learn] behaviour of OneHotEncoder somewhat confusing
Andreas Mueller
t3kcit at gmail.com
Thu Sep 22 01:08:02 EDT 2016
Yeah the input format is a bit odd, usually it should be n_samples x
n_features, so something like
[['A'], ['C'], ['T'], ['G']]
Though this is currently also hard to do :(
On 09/20/2016 05:50 AM, Lee Zamparo wrote:
> Hi Joel,
>
> Yea, seems that the one-hot encoding of the transpose solves the
> issue. As you say, and as I mentioned to Sebastian, it seems a bit
> off-usage for OneHotEncoder.
>
> Thanks for the solution all the same though.
>
> --
> Lee Zamparo
>
> On September 19, 2016 at 7:48:15 PM, Joel Nothman
> (joel.nothman at gmail.com <mailto:joel.nothman at gmail.com>) wrote:
>
>> OneHotCoder has issues, but I think all you want here is
>>
>> ohe.fit_transform(np.transpose(le.fit_transform([c for c in myguide])))
>>
>> Still, this seems like it is far from the intended use of
>> OneHotEncoder (which should not really be stacked with LabelEncoder),
>> so it's not surprising it's tricky.
>>
>> On 20 September 2016 at 08:07, Sebastian Raschka
>> <se.raschka at gmail.com <mailto:se.raschka at gmail.com>> wrote:
>>
>> Hi, Lee,
>>
>> maybe set `n_value=4`, this seems to do the job. I think the
>> problem you encountered is due to the fact that the one-hot
>> encoder infers the number of values for each feature (column)
>> from the dataset. In your case, each column had only 1 unique
>> feature in your example
>>
>> > array([[0, 1, 2, 3],
>> > [0, 1, 2, 3],
>> > [0, 1, 2, 3]])
>>
>> If you had an array like
>>
>> > array([[0],
>> > [1],
>> > [2],
>> > [3]])
>>
>> it should work though. Alternatively, set n_values to 4:
>>
>>
>> > >>> from sklearn.preprocessing import OneHotEncoder
>> > >>> import numpy as np
>> >
>> > >>> enc = OneHotEncoder(n_values=4)
>> > >>> X = np.array([[0, 1, 2, 3]])
>> > >>> enc.fit_transform(X).toarray()
>>
>>
>> array([[ 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1.,
>> 0., 0.,
>> 0., 0., 1.]])
>>
>> and
>>
>> > X2 = np.array([[0, 1, 2, 3],
>> > [0, 1, 2, 3],
>> > [0, 1, 2, 3]])
>> >
>> > enc.transform(X2).toarray()
>>
>>
>>
>> array([[ 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1.,
>> 0., 0.,
>> 0., 0., 1.],
>> [ 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1.,
>> 0., 0.,
>> 0., 0., 1.],
>> [ 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1.,
>> 0., 0.,
>> 0., 0., 1.]])
>>
>>
>> Best,
>> Sebastian
>>
>>
>> > On Sep 19, 2016, at 5:45 PM, Lee Zamparo <zamparo at gmail.com
>> <mailto:zamparo at gmail.com>> wrote:
>> >
>> > Hi sklearners,
>> >
>> > A lab-mate came to me with a problem about encoding DNA
>> sequences using preprocessing.OneHotEncoder, and I find it to
>> produce confusing results.
>> >
>> > Suppose I have a DNA string: myguide = ‘ACGT’
>> >
>> > He’d like use OneHotEncoder to transform DNA strings, character
>> by character, into a one hot encoded representation like this:
>> [[1,0,0,0], [0,1,0,0], [0,0,1,0], [0,0,0,1]]. The use-case seems
>> to be solved in pandas using the dubiously named get_dummies
>> method
>> (http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.get_dummies.html
>> <http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.get_dummies.html>).
>> I thought that it would be trivial to do with OneHotEncoder, but
>> it seems strangely difficult:
>> >
>> > In [23]: myarray = le.fit_transform([c for c in myguide])
>> >
>> > In [24]: myarray
>> > Out[24]: array([0, 1, 2, 3])
>> >
>> > In [27]: myarray = le.transform([[c for c in myguide],[c for c
>> in myguide],[c for c in myguide]])
>> >
>> > In [28]: myarray
>> > Out[28]:
>> > array([[0, 1, 2, 3],
>> > [0, 1, 2, 3],
>> > [0, 1, 2, 3]])
>> >
>> > In [29]: ohe.fit_transform(myarray)
>> > Out[29]:
>> > array([[ 1., 1., 1., 1.],
>> > [ 1., 1., 1., 1.],
>> > [ 1., 1., 1., 1.]]) <— ????
>> >
>> > So this is not at all what I expected. I read the
>> documentation for OneHotEncoder
>> (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder
>> <http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder>),
>> but did not find if clear how it worked (also I found the example
>> using integers confusing). Neither FeatureHasher nor
>> DictVectorizer seem to be more appropriate for transforming
>> strings into positional OneHot encoded arrays. Am I missing
>> something, or is this operation not supported in sklearn?
>> >
>> > Thanks,
>> >
>> > --
>> > Lee Zamparo
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org <mailto:scikit-learn at python.org>
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> <https://mail.python.org/mailman/listinfo/scikit-learn>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org <mailto:scikit-learn at python.org>
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> <https://mail.python.org/mailman/listinfo/scikit-learn>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org <mailto:scikit-learn at python.org>
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160922/a773995d/attachment-0001.html>
More information about the scikit-learn
mailing list