[scikit-learn] behaviour of OneHotEncoder somewhat confusing

Thu Sep 22 01:08:02 EDT 2016

Yeah the input format is a bit odd, usually it should be n_samples x 
n_features, so something like
[['A'], ['C'], ['T'], ['G']]

Though this is currently also hard to do :(

On 09/20/2016 05:50 AM, Lee Zamparo wrote:
> Hi Joel,
>
> Yea, seems that the one-hot encoding of the transpose solves the 
> issue.  As you say, and as I mentioned to Sebastian, it seems a bit 
> off-usage for OneHotEncoder.
>
> Thanks for the solution all the same though.
>
> -- 
> Lee Zamparo
>
> On September 19, 2016 at 7:48:15 PM, Joel Nothman 
> (joel.nothman at gmail.com <mailto:joel.nothman at gmail.com>) wrote:
>
>> OneHotCoder has issues, but I think all you want here is
>>
>> ohe.fit_transform(np.transpose(le.fit_transform([c for c in myguide])))
>>
>> Still, this seems like it is far from the intended use of 
>> OneHotEncoder (which should not really be stacked with LabelEncoder), 
>> so it's not surprising it's tricky.
>>
>> On 20 September 2016 at 08:07, Sebastian Raschka 
>> <se.raschka at gmail.com <mailto:se.raschka at gmail.com>> wrote:
>>
>>     Hi, Lee,
>>
>>     maybe set `n_value=4`, this seems to do the job. I think the
>>     problem you encountered is due to the fact that the one-hot
>>     encoder infers the number of values for each feature (column)
>>     from the dataset. In your case, each column had only 1 unique
>>     feature in your example
>>
>>     > array([[0, 1, 2, 3],
>>     >        [0, 1, 2, 3],
>>     >        [0, 1, 2, 3]])
>>
>>     If you had an array like
>>
>>     > array([[0],
>>     >           [1],
>>     >           [2],
>>     >          [3]])
>>
>>     it should work though. Alternatively, set n_values to 4:
>>
>>
>>     > >>> from sklearn.preprocessing import OneHotEncoder
>>     > >>> import numpy as np
>>     >
>>     > >>> enc = OneHotEncoder(n_values=4)
>>     > >>> X = np.array([[0, 1, 2, 3]])
>>     > >>> enc.fit_transform(X).toarray()
>>
>>
>>     array([[ 1.,  0.,  0.,  0.,  0., 1.,  0.,  0.,  0.,  0.,  1.,
>>     0.,  0.,
>>              0.,  0.,  1.]])
>>
>>     and
>>
>>     > X2 = np.array([[0, 1, 2, 3],
>>     >  [0, 1, 2, 3],
>>     >                [0, 1, 2, 3]])
>>     >
>>     > enc.transform(X2).toarray()
>>
>>
>>
>>     array([[ 1.,  0.,  0.,  0.,  0., 1.,  0.,  0.,  0.,  0.,  1.,
>>     0.,  0.,
>>              0.,  0.,  1.],
>>            [ 1.,  0.,  0., 0.,  0.,  1.,  0.,  0.,  0., 0.,  1., 
>>     0.,  0.,
>>              0.,  0.,  1.],
>>            [ 1.,  0.,  0., 0.,  0.,  1.,  0.,  0.,  0., 0.,  1., 
>>     0.,  0.,
>>              0.,  0.,  1.]])
>>
>>
>>     Best,
>>     Sebastian
>>
>>
>>     > On Sep 19, 2016, at 5:45 PM, Lee Zamparo <zamparo at gmail.com
>>     <mailto:zamparo at gmail.com>> wrote:
>>     >
>>     > Hi sklearners,
>>     >
>>     > A lab-mate came to me with a problem about encoding DNA
>>     sequences using preprocessing.OneHotEncoder, and I find it to
>>     produce confusing results.
>>     >
>>     > Suppose I have a DNA string:  myguide = ‘ACGT’
>>     >
>>     > He’d like use OneHotEncoder to transform DNA strings, character
>>     by character, into a one hot encoded representation like this:
>>     [[1,0,0,0], [0,1,0,0], [0,0,1,0], [0,0,0,1]].  The use-case seems
>>     to be solved in pandas using the dubiously named get_dummies
>>     method
>>     (http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.get_dummies.html
>>     <http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.get_dummies.html>).
>>     I thought that it would be trivial to do with OneHotEncoder, but
>>     it seems strangely difficult:
>>     >
>>     > In [23]: myarray = le.fit_transform([c for c in myguide])
>>     >
>>     > In [24]: myarray
>>     > Out[24]: array([0, 1, 2, 3])
>>     >
>>     > In [27]: myarray = le.transform([[c for c in myguide],[c for c
>>     in myguide],[c for c in myguide]])
>>     >
>>     > In [28]: myarray
>>     > Out[28]:
>>     > array([[0, 1, 2, 3],
>>     >        [0, 1, 2, 3],
>>     >        [0, 1, 2, 3]])
>>     >
>>     > In [29]: ohe.fit_transform(myarray)
>>     > Out[29]:
>>     > array([[ 1.,  1.,  1.,  1.],
>>     >        [ 1.,  1.,  1., 1.],
>>     >        [ 1.,  1.,  1., 1.]])    <— ????
>>     >
>>     > So this is not at all what I expected.  I read the
>>     documentation for OneHotEncoder
>>     (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder
>>     <http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder>),
>>     but did not find if clear how it worked (also I found the example
>>     using integers confusing).  Neither FeatureHasher nor
>>     DictVectorizer seem to be more appropriate for transforming
>>     strings into positional OneHot encoded arrays.  Am I missing
>>     something, or is this operation not supported in sklearn?
>>     >
>>     > Thanks,
>>     >
>>     > --
>>     > Lee Zamparo
>>     > _______________________________________________
>>     > scikit-learn mailing list
>>     > scikit-learn at python.org <mailto:scikit-learn at python.org>
>>     > https://mail.python.org/mailman/listinfo/scikit-learn
>>     <https://mail.python.org/mailman/listinfo/scikit-learn>
>>
>>     _______________________________________________
>>     scikit-learn mailing list
>>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>>     https://mail.python.org/mailman/listinfo/scikit-learn
>>     <https://mail.python.org/mailman/listinfo/scikit-learn>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org <mailto:scikit-learn at python.org>
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160922/a773995d/attachment-0001.html>