<div dir="ltr">OneHotCoder has issues, but I think all you want here is<div><br></div><div><span style="font-size:12.8px">ohe.fit_transform(np.transpose(</span><span style="font-size:12.8px">le.fit_transform([c for c in myguide])))</span><br style="font-size:12.8px"></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">Still, this seems like it is far from the intended use of OneHotEncoder (which should not really be stacked with LabelEncoder), so it's not surprising it's tricky.</span></div></div><div class="gmail_extra"><br><div class="gmail_quote">On 20 September 2016 at 08:07, Sebastian Raschka <span dir="ltr"><<a href="mailto:se.raschka@gmail.com" target="_blank">se.raschka@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi, Lee,<br>
<br>
maybe set `n_value=4`, this seems to do the job. I think the problem you encountered is due to the fact that the one-hot encoder infers the number of values for each feature (column) from the dataset. In your case, each column had only 1 unique feature in your example<br>
<span class=""><br>
> array([[0, 1, 2, 3],<br>
> [0, 1, 2, 3],<br>
> [0, 1, 2, 3]])<br>
<br>
</span>If you had an array like<br>
<br>
> array([[0],<br>
> [1],<br>
> [2],<br>
> [3]])<br>
<br>
it should work though. Alternatively, set n_values to 4:<br>
<br>
<br>
> >>> from sklearn.preprocessing import OneHotEncoder<br>
> >>> import numpy as np<br>
><br>
> >>> enc = OneHotEncoder(n_values=4)<br>
> >>> X = np.array([[0, 1, 2, 3]])<br>
> >>> enc.fit_transform(X).toarray()<br>
<br>
<br>
array([[ 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0.,<br>
0., 0., 1.]])<br>
<br>
and<br>
<br>
> X2 = np.array([[0, 1, 2, 3],<br>
<span class="">> [0, 1, 2, 3],<br>
> [0, 1, 2, 3]])<br>
><br>
</span>> enc.transform(X2).toarray()<br>
<br>
<br>
<br>
array([[ 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0.,<br>
0., 0., 1.],<br>
[ 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0.,<br>
0., 0., 1.],<br>
[ 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0.,<br>
0., 0., 1.]])<br>
<br>
<br>
Best,<br>
Sebastian<br>
<div><div class="h5"><br>
<br>
> On Sep 19, 2016, at 5:45 PM, Lee Zamparo <<a href="mailto:zamparo@gmail.com">zamparo@gmail.com</a>> wrote:<br>
><br>
> Hi sklearners,<br>
><br>
> A lab-mate came to me with a problem about encoding DNA sequences using preprocessing.OneHotEncoder, and I find it to produce confusing results.<br>
><br>
> Suppose I have a DNA string: myguide = ‘ACGT’<br>
><br>
> He’d like use OneHotEncoder to transform DNA strings, character by character, into a one hot encoded representation like this: [[1,0,0,0], [0,1,0,0], [0,0,1,0], [0,0,0,1]]. The use-case seems to be solved in pandas using the dubiously named get_dummies method (<a href="http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.get_dummies.html" rel="noreferrer" target="_blank">http://pandas.pydata.org/<wbr>pandas-docs/version/0.13.1/<wbr>generated/pandas.get_dummies.<wbr>html</a>). I thought that it would be trivial to do with OneHotEncoder, but it seems strangely difficult:<br>
><br>
> In [23]: myarray = le.fit_transform([c for c in myguide])<br>
><br>
> In [24]: myarray<br>
> Out[24]: array([0, 1, 2, 3])<br>
><br>
> In [27]: myarray = le.transform([[c for c in myguide],[c for c in myguide],[c for c in myguide]])<br>
><br>
> In [28]: myarray<br>
> Out[28]:<br>
> array([[0, 1, 2, 3],<br>
> [0, 1, 2, 3],<br>
> [0, 1, 2, 3]])<br>
><br>
> In [29]: ohe.fit_transform(myarray)<br>
> Out[29]:<br>
> array([[ 1., 1., 1., 1.],<br>
> [ 1., 1., 1., 1.],<br>
> [ 1., 1., 1., 1.]]) <— ????<br>
><br>
> So this is not at all what I expected. I read the documentation for OneHotEncoder (<a href="http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder" rel="noreferrer" target="_blank">http://scikit-learn.org/<wbr>stable/modules/generated/<wbr>sklearn.preprocessing.<wbr>OneHotEncoder.html#sklearn.<wbr>preprocessing.OneHotEncoder</a>), but did not find if clear how it worked (also I found the example using integers confusing). Neither FeatureHasher nor DictVectorizer seem to be more appropriate for transforming strings into positional OneHot encoded arrays. Am I missing something, or is this operation not supported in sklearn?<br>
><br>
> Thanks,<br>
><br>
> --<br>
> Lee Zamparo<br>
</div></div>> ______________________________<wbr>_________________<br>
> scikit-learn mailing list<br>
> <a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a><br>
> <a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/scikit-learn</a><br>
<br>
______________________________<wbr>_________________<br>
scikit-learn mailing list<br>
<a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a><br>
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/scikit-learn</a><br>
</blockquote></div><br></div>