[scikit-learn] behaviour of OneHotEncoder somewhat confusing

Mon Sep 19 20:20:31 EDT 2016

Hi Joel,

Yea, seems that the one-hot encoding of the transpose solves the issue.  As
you say, and as I mentioned to Sebastian, it seems a bit off-usage for
OneHotEncoder.

Thanks for the solution all the same though.

-- 
Lee Zamparo

On September 19, 2016 at 7:48:15 PM, Joel Nothman (joel.nothman at gmail.com)
wrote:

OneHotCoder has issues, but I think all you want here is

ohe.fit_transform(np.transpose(le.fit_transform([c for c in myguide])))

Still, this seems like it is far from the intended use of OneHotEncoder
(which should not really be stacked with LabelEncoder), so it's not
surprising it's tricky.

On 20 September 2016 at 08:07, Sebastian Raschka <se.raschka at gmail.com>
wrote:

> Hi, Lee,
>
> maybe set `n_value=4`, this seems to do the job. I think the problem you
> encountered is due to the fact that the one-hot encoder infers the number
> of values for each feature (column) from the dataset. In your case, each
> column had only 1 unique feature in your example
>
> > array([[0, 1, 2, 3],
> >        [0, 1, 2, 3],
> >        [0, 1, 2, 3]])
>
> If you had an array like
>
> > array([[0],
> >           [1],
> >           [2],
> >          [3]])
>
> it should work though. Alternatively, set n_values to 4:
>
>
> > >>> from sklearn.preprocessing import OneHotEncoder
> > >>> import numpy as np
> >
> > >>> enc = OneHotEncoder(n_values=4)
> > >>> X = np.array([[0, 1, 2, 3]])
> > >>> enc.fit_transform(X).toarray()
>
>
> array([[ 1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
>          0.,  0.,  1.]])
>
> and
>
> > X2 = np.array([[0, 1, 2, 3],
> >                [0, 1, 2, 3],
> >                [0, 1, 2, 3]])
> >
> > enc.transform(X2).toarray()
>
>
>
> array([[ 1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
>          0.,  0.,  1.],
>        [ 1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
>          0.,  0.,  1.],
>        [ 1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
>          0.,  0.,  1.]])
>
>
> Best,
> Sebastian
>
>
> > On Sep 19, 2016, at 5:45 PM, Lee Zamparo <zamparo at gmail.com> wrote:
> >
> > Hi sklearners,
> >
> > A lab-mate came to me with a problem about encoding DNA sequences using
> preprocessing.OneHotEncoder, and I find it to produce confusing results.
> >
> > Suppose I have a DNA string:  myguide = ‘ACGT’
> >
> > He’d like use OneHotEncoder to transform DNA strings, character by
> character, into a one hot encoded representation like this: [[1,0,0,0],
> [0,1,0,0], [0,0,1,0], [0,0,0,1]].  The use-case seems to be solved in
> pandas using the dubiously named get_dummies method (
> http://pandas.pydata.org/pandas-docs/version/0.13.1/
> generated/pandas.get_dummies.html).  I thought that it would be trivial
> to do with OneHotEncoder, but it seems strangely difficult:
> >
> > In [23]: myarray = le.fit_transform([c for c in myguide])
> >
> > In [24]: myarray
> > Out[24]: array([0, 1, 2, 3])
> >
> > In [27]: myarray = le.transform([[c for c in myguide],[c for c in
> myguide],[c for c in myguide]])
> >
> > In [28]: myarray
> > Out[28]:
> > array([[0, 1, 2, 3],
> >        [0, 1, 2, 3],
> >        [0, 1, 2, 3]])
> >
> > In [29]: ohe.fit_transform(myarray)
> > Out[29]:
> > array([[ 1.,  1.,  1.,  1.],
> >        [ 1.,  1.,  1.,  1.],
> >        [ 1.,  1.,  1.,  1.]])    <— ????
> >
> > So this is not at all what I expected.  I read the documentation for
> OneHotEncoder (http://scikit-learn.org/stable/modules/generated/
> sklearn.preprocessing.OneHotEncoder.html#sklearn.
> preprocessing.OneHotEncoder), but did not find if clear how it worked
> (also I found the example using integers confusing).  Neither FeatureHasher
> nor DictVectorizer seem to be more appropriate for transforming strings
> into positional OneHot encoded arrays.  Am I missing something, or is this
> operation not supported in sklearn?
> >
> > Thanks,
> >
> > --
> > Lee Zamparo
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>

_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160919/29e37bdc/attachment-0001.html>