<div dir="ltr">OneHotCoder has issues, but I think all you want here is<div><br></div><div><span style="font-size:12.8px">ohe.fit_transform(np.transpose(</span><span style="font-size:12.8px">le.fit_transform([c for c in myguide])))</span><br style="font-size:12.8px"></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">Still, this seems like it is far from the intended use of OneHotEncoder (which should not really be stacked with LabelEncoder), so it's not surprising it's tricky.</span></div></div><div class="gmail_extra"><br><div class="gmail_quote">On 20 September 2016 at 08:07, Sebastian Raschka <span dir="ltr"><<a href="mailto:se.raschka@gmail.com" target="_blank">se.raschka@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi, Lee,<br>

<br>

maybe set `n_value=4`, this seems to do the job. I think the problem you encountered is due to the fact that the one-hot encoder infers the number of values for each feature (column) from the dataset. In your case, each column had only 1 unique feature in your example<br>

<span class=""><br>

> array([[0, 1, 2, 3],<br>

>        [0, 1, 2, 3],<br>

>        [0, 1, 2, 3]])<br>

<br>

</span>If you had an array like<br>

<br>

> array([[0],<br>

>           [1],<br>

>           [2],<br>

>          [3]])<br>

<br>

it should work though. Alternatively, set n_values to 4:<br>

<br>

<br>

> >>> from sklearn.preprocessing import OneHotEncoder<br>

> >>> import numpy as np<br>

><br>

> >>> enc = OneHotEncoder(n_values=4)<br>

> >>> X = np.array([[0, 1, 2, 3]])<br>

> >>> enc.fit_transform(X).toarray()<br>

<br>

<br>

array([[ 1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,<br>

         0.,  0.,  1.]])<br>

<br>

and<br>

<br>

> X2 = np.array([[0, 1, 2, 3],<br>

<span class="">>                [0, 1, 2, 3],<br>

>                [0, 1, 2, 3]])<br>

><br>

</span>> enc.transform(X2).toarray()<br>

<br>

<br>

<br>

array([[ 1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,<br>

         0.,  0.,  1.],<br>

       [ 1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,<br>

         0.,  0.,  1.],<br>

       [ 1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,<br>

         0.,  0.,  1.]])<br>

<br>

<br>

Best,<br>

Sebastian<br>

<div><div class="h5"><br>

<br>

> On Sep 19, 2016, at 5:45 PM, Lee Zamparo <<a href="mailto:zamparo@gmail.com">zamparo@gmail.com</a>> wrote:<br>

><br>

> Hi sklearners,<br>

><br>

> A lab-mate came to me with a problem about encoding DNA sequences using preprocessing.OneHotEncoder, and I find it to produce confusing results.<br>

><br>

> Suppose I have a DNA string:  myguide = ‘ACGT’<br>

><br>

> He’d like use OneHotEncoder to transform DNA strings, character by character, into a one hot encoded representation like this: [[1,0,0,0], [0,1,0,0], [0,0,1,0], [0,0,0,1]].  The use-case seems to be solved in pandas using the dubiously named get_dummies method (<a href="http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.get_dummies.html" rel="noreferrer" target="_blank">http://pandas.pydata.org/<wbr>pandas-docs/version/0.13.1/<wbr>generated/pandas.get_dummies.<wbr>html</a>).  I thought that it would be trivial to do with OneHotEncoder, but it seems strangely difficult:<br>

><br>

> In [23]: myarray = le.fit_transform([c for c in myguide])<br>

><br>

> In [24]: myarray<br>

> Out[24]: array([0, 1, 2, 3])<br>

><br>

> In [27]: myarray = le.transform([[c for c in myguide],[c for c in myguide],[c for c in myguide]])<br>

><br>

> In [28]: myarray<br>

> Out[28]:<br>

> array([[0, 1, 2, 3],<br>

>        [0, 1, 2, 3],<br>

>        [0, 1, 2, 3]])<br>

><br>

> In [29]: ohe.fit_transform(myarray)<br>

> Out[29]:<br>

> array([[ 1.,  1.,  1.,  1.],<br>

>        [ 1.,  1.,  1.,  1.],<br>

>        [ 1.,  1.,  1.,  1.]])    <— ????<br>

><br>

> So this is not at all what I expected.  I read the documentation for OneHotEncoder (<a href="http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder" rel="noreferrer" target="_blank">http://scikit-learn.org/<wbr>stable/modules/generated/<wbr>sklearn.preprocessing.<wbr>OneHotEncoder.html#sklearn.<wbr>preprocessing.OneHotEncoder</a>), but did not find if clear how it worked (also I found the example using integers confusing).  Neither FeatureHasher nor DictVectorizer seem to be more appropriate for transforming strings into positional OneHot encoded arrays.  Am I missing something, or is this operation not supported in sklearn?<br>

><br>

> Thanks,<br>

><br>

> --<br>

> Lee Zamparo<br>

</div></div>> ______________________________<wbr>_________________<br>

> scikit-learn mailing list<br>

> <a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a><br>

> <a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/scikit-learn</a><br>

<br>

______________________________<wbr>_________________<br>

scikit-learn mailing list<br>

<a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/scikit-learn</a><br>

</blockquote></div><br></div>