<div dir="ltr">I don't consider LabelBinarizer the best workaround.<div><br></div><div>Given a Pandas dataframe df, I'd use:</div><div><br></div><div>DictVectorizer().fit_transform(df.to_dict(orient='records'))</div><div><br></div><div>which will handle encoding strings with one-hot and numerical features as column vectors. Or:</div><div><br></div><div><div><font face="monospace, monospace">class PandasVectorizer(DictVectorizer):</font></div><div><font face="monospace, monospace"> def fit(self, x, y=None):</font></div><div><font face="monospace, monospace"> return super(PandasVectorizer, self).fit(x.to_dict('records'))</font></div><div><font face="monospace, monospace"> def fit_transform(self, x, y=None):</font></div><div><font face="monospace, monospace"> return super(PandasVectorizer, self).fit_transform(x.to_dict('records'))</font></div><div><font face="monospace, monospace"> def transform(self, x):</font></div><div><font face="monospace, monospace"> return super(PandasVectorizer, self).transform(x.to_dict('records'))</font></div></div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On 18 August 2017 at 01:11, Andreas Mueller <span dir="ltr"><<a href="mailto:t3kcit@gmail.com" target="_blank">t3kcit@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF">
Hi Georg.<br>
Unfortunately this is not entirely trivial right now, but will be
fixed by<br>
<a class="m_3691155390186156182moz-txt-link-freetext" href="https://github.com/scikit-learn/scikit-learn/pull/9151" target="_blank">https://github.com/scikit-<wbr>learn/scikit-learn/pull/9151</a><br>
and<br>
<a class="m_3691155390186156182moz-txt-link-freetext" href="https://github.com/scikit-learn/scikit-learn/pull/9012" target="_blank">https://github.com/scikit-<wbr>learn/scikit-learn/pull/9012</a><br>
which will be in the next release (0.20).<br>
<br>
LabelBinarizer is probably the best work-around for now, and
selecting columns can be done (awkwardly)<br>
like in this example:
<a class="m_3691155390186156182moz-txt-link-freetext" href="http://scikit-learn.org/dev/auto_examples/hetero_feature_union.html#sphx-glr-auto-examples-hetero-feature-union-py" target="_blank">http://scikit-learn.org/dev/<wbr>auto_examples/hetero_feature_<wbr>union.html#sphx-glr-auto-<wbr>examples-hetero-feature-union-<wbr>py</a><br>
<br>
Best,<br>
Andy<div><div class="h5"><br>
<br>
<div class="m_3691155390186156182moz-cite-prefix">On 08/17/2017 07:50 AM, Georg Heiler
wrote:<br>
</div>
</div></div><blockquote type="cite"><div><div class="h5">
<div dir="ltr">Hi,
<div><br>
</div>
<div>how can I properly handle categorical values in
scikit-learn?</div>
<div><a href="https://stackoverflow.com/questions/45727934/pandas-categories-new-levels?noredirect=1#comment78424496_45727934" target="_blank">https://stackoverflow.com/<wbr>questions/45727934/pandas-<wbr>categories-new-levels?<wbr>noredirect=1#comment78424496_<wbr>45727934</a> <br>
</div>
<div><br>
</div>
<div>
<p style="margin:1em 0px 0px;padding:0px;text-align:justify;font-size:14px">goals</p>
<ul style="margin:1em 2em 0px;padding:0px;list-style-position:initial;font-size:14px">
<li style="margin:0px;padding:0px;line-height:20px">scikit-learn
syle fit/transform methods to encode labels of categorical
features of X</li>
<li style="margin:0px;padding:0px;line-height:20px">should
handle unseen labels</li>
<li style="margin:0px;padding:0px;line-height:20px">should
be faster than running a label encoder manually for each
fold and manually checking if the label already was seen
in the training data i.e. what I currently do (<a href="https://stackoverflow.com/questions/45727934/pandas-categories-new-levels?noredirect=1#comment78424496_45727934" style="margin:0px;padding:0px;color:rgb(0,136,204)" target="_blank">https://stackoverflow.com/<wbr>questions/45727934/pandas-<wbr>categories-new-levels?<wbr>noredirect=1#comment78424496_<wbr>45727934</a> which
links to <a href="https://gist.github.com/geoHeil/5caff5236b4850d673b2c9b0799dc2ce" style="margin:0px;padding:0px;color:rgb(0,136,204)" target="_blank">https://gist.github.com/<wbr>geoHeil/<wbr>5caff5236b4850d673b2c9b0799dc2<wbr>ce</a>)</li>
<li style="margin:0px;padding:0px;line-height:20px">only
some columns are categorical, and only these should be
converted</li>
</ul>
<div><br>
</div>
</div>
<div>Regards,</div>
<div>Georg</div>
</div>
<br>
<fieldset class="m_3691155390186156182mimeAttachmentHeader"></fieldset>
<br>
</div></div><pre>______________________________<wbr>_________________
scikit-learn mailing list
<a class="m_3691155390186156182moz-txt-link-abbreviated" href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a>
<a class="m_3691155390186156182moz-txt-link-freetext" href="https://mail.python.org/mailman/listinfo/scikit-learn" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/scikit-learn</a>
</pre>
</blockquote>
<br>
</div>
<br>______________________________<wbr>_________________<br>
scikit-learn mailing list<br>
<a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a><br>
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/scikit-learn</a><br>
<br></blockquote></div><br></div>