[scikit-learn] Categorical handling

Joel Nothman joel.nothman at gmail.com
Thu Aug 17 11:26:13 EDT 2017


I don't consider LabelBinarizer the best workaround.

Given a Pandas dataframe df, I'd use:

DictVectorizer().fit_transform(df.to_dict(orient='records'))

which will handle encoding strings with one-hot and numerical features as
column vectors. Or:

class PandasVectorizer(DictVectorizer):
    def fit(self, x, y=None):
        return super(PandasVectorizer, self).fit(x.to_dict('records'))
    def fit_transform(self, x, y=None):
        return super(PandasVectorizer,
self).fit_transform(x.to_dict('records'))
    def transform(self, x):
        return super(PandasVectorizer, self).transform(x.to_dict('records'))


On 18 August 2017 at 01:11, Andreas Mueller <t3kcit at gmail.com> wrote:

> Hi Georg.
> Unfortunately this is not entirely trivial right now, but will be fixed by
> https://github.com/scikit-learn/scikit-learn/pull/9151
> and
> https://github.com/scikit-learn/scikit-learn/pull/9012
> which will be in the next release (0.20).
>
> LabelBinarizer is probably the best work-around for now, and selecting
> columns can be done (awkwardly)
> like in this example: http://scikit-learn.org/dev/
> auto_examples/hetero_feature_union.html#sphx-glr-auto-
> examples-hetero-feature-union-py
>
> Best,
> Andy
>
>
> On 08/17/2017 07:50 AM, Georg Heiler wrote:
>
> Hi,
>
> how can I properly handle categorical values in scikit-learn?
> https://stackoverflow.com/questions/45727934/pandas-categories-new-levels?
> noredirect=1#comment78424496_45727934
>
> goals
>
>    - scikit-learn syle fit/transform methods to encode labels of
>    categorical features of X
>    - should handle unseen labels
>    - should be faster than running a label encoder manually for each fold
>    and manually checking if the label already was seen in the training data
>    i.e. what I currently do (https://stackoverflow.com/
>    questions/45727934/pandas-categories-new-levels?
>    noredirect=1#comment78424496_45727934
>    <https://stackoverflow.com/questions/45727934/pandas-categories-new-levels?noredirect=1#comment78424496_45727934> which
>    links to https://gist.github.com/geoHeil/5caff5236b4850d673b2c9b0799dc2
>    ce)
>    - only some columns are categorical, and only these should be converted
>
>
> Regards,
> Georg
>
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170818/15305614/attachment.html>


More information about the scikit-learn mailing list