<div dir="ltr">gist at <a href="https://gist.github.com/jnothman/a75bac778c1eb9661017555249e50379">https://gist.github.com/jnothman/a75bac778c1eb9661017555249e50379</a></div><div class="gmail_extra"><br><div class="gmail_quote">On 18 August 2017 at 01:26, Joel Nothman <span dir="ltr"><<a href="mailto:joel.nothman@gmail.com" target="_blank">joel.nothman@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">I don't consider LabelBinarizer the best workaround.<div><br></div><div>Given a Pandas dataframe df, I'd use:</div><div><br></div><div>DictVectorizer().fit_<wbr>transform(df.to_dict(orient='<wbr>records'))</div><div><br></div><div>which will handle encoding strings with one-hot and numerical features as column vectors. Or:</div><div><br></div><div><div><font face="monospace, monospace">class PandasVectorizer(<wbr>DictVectorizer):</font></div><div><font face="monospace, monospace">    def fit(self, x, y=None):</font></div><div><font face="monospace, monospace">        return super(PandasVectorizer, self).fit(x.to_dict('records')<wbr>)</font></div><div><font face="monospace, monospace">    def fit_transform(self, x, y=None):</font></div><div><font face="monospace, monospace">        return super(PandasVectorizer, self).fit_transform(x.to_dict(<wbr>'records'))</font></div><div><font face="monospace, monospace">    def transform(self, x):</font></div><div><font face="monospace, monospace">        return super(PandasVectorizer, self).transform(x.to_dict('<wbr>records'))</font></div></div><div><br></div></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">On 18 August 2017 at 01:11, Andreas Mueller <span dir="ltr"><<a href="mailto:t3kcit@gmail.com" target="_blank">t3kcit@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
  
    
  
  <div text="#000000" bgcolor="#FFFFFF">
    Hi Georg.<br>
    Unfortunately this is not entirely trivial right now, but will be
    fixed by<br>
    <a class="m_-8253204342144352300m_3691155390186156182moz-txt-link-freetext" href="https://github.com/scikit-learn/scikit-learn/pull/9151" target="_blank">https://github.com/scikit-lear<wbr>n/scikit-learn/pull/9151</a><br>
    and<br>
    <a class="m_-8253204342144352300m_3691155390186156182moz-txt-link-freetext" href="https://github.com/scikit-learn/scikit-learn/pull/9012" target="_blank">https://github.com/scikit-lear<wbr>n/scikit-learn/pull/9012</a><br>
    which will be in the next release (0.20).<br>
    <br>
    LabelBinarizer is probably the best work-around for now, and
    selecting columns can be done (awkwardly)<br>
    like in this example:
<a class="m_-8253204342144352300m_3691155390186156182moz-txt-link-freetext" href="http://scikit-learn.org/dev/auto_examples/hetero_feature_union.html#sphx-glr-auto-examples-hetero-feature-union-py" target="_blank">http://scikit-learn.org/dev/au<wbr>to_examples/hetero_feature_uni<wbr>on.html#sphx-glr-auto-examples<wbr>-hetero-feature-union-py</a><br>
    <br>
    Best,<br>
    Andy<div><div class="m_-8253204342144352300h5"><br>
    <br>
    <div class="m_-8253204342144352300m_3691155390186156182moz-cite-prefix">On 08/17/2017 07:50 AM, Georg Heiler
      wrote:<br>
    </div>
    </div></div><blockquote type="cite"><div><div class="m_-8253204342144352300h5">
      <div dir="ltr">Hi,
        <div><br>
        </div>
        <div>how can I properly handle categorical values in
          scikit-learn?</div>
        <div><a href="https://stackoverflow.com/questions/45727934/pandas-categories-new-levels?noredirect=1#comment78424496_45727934" target="_blank">https://stackoverflow.com/ques<wbr>tions/45727934/pandas-categori<wbr>es-new-levels?noredirect=1#<wbr>comment78424496_45727934</a> <br>
        </div>
        <div><br>
        </div>
        <div>
          <p style="margin:1em 0px 0px;padding:0px;text-align:justify;font-size:14px">goals</p>
          <ul style="margin:1em 2em 0px;padding:0px;list-style-position:initial;font-size:14px">
            <li style="margin:0px;padding:0px;line-height:20px">scikit-learn
              syle fit/transform methods to encode labels of categorical
              features of X</li>
            <li style="margin:0px;padding:0px;line-height:20px">should
              handle unseen labels</li>
            <li style="margin:0px;padding:0px;line-height:20px">should
              be faster than running a label encoder manually for each
              fold and manually checking if the label already was seen
              in the training data i.e. what I currently do (<a href="https://stackoverflow.com/questions/45727934/pandas-categories-new-levels?noredirect=1#comment78424496_45727934" style="margin:0px;padding:0px;color:rgb(0,136,204)" target="_blank">https://stackoverflow.com/que<wbr>stions/45727934/pandas-categor<wbr>ies-new-levels?noredirect=1#<wbr>comment78424496_45727934</a> which
              links to <a href="https://gist.github.com/geoHeil/5caff5236b4850d673b2c9b0799dc2ce" style="margin:0px;padding:0px;color:rgb(0,136,204)" target="_blank">https://gist.github.com/geo<wbr>Heil/5caff5236b4850d673b2c9b07<wbr>99dc2ce</a>)</li>
            <li style="margin:0px;padding:0px;line-height:20px">only
              some columns are categorical, and only these should be
              converted</li>
          </ul>
          <div><br>
          </div>
        </div>
        <div>Regards,</div>
        <div>Georg</div>
      </div>
      <br>
      <fieldset class="m_-8253204342144352300m_3691155390186156182mimeAttachmentHeader"></fieldset>
      <br>
      </div></div><pre>______________________________<wbr>_________________
scikit-learn mailing list
<a class="m_-8253204342144352300m_3691155390186156182moz-txt-link-abbreviated" href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a>
<a class="m_-8253204342144352300m_3691155390186156182moz-txt-link-freetext" href="https://mail.python.org/mailman/listinfo/scikit-learn" target="_blank">https://mail.python.org/mailma<wbr>n/listinfo/scikit-learn</a>
</pre>
    </blockquote>
    <br>
  </div>

<br>______________________________<wbr>_________________<br>
scikit-learn mailing list<br>
<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a><br>
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailma<wbr>n/listinfo/scikit-learn</a><br>
<br></blockquote></div><br></div>
</div></div></blockquote></div><br></div>