[scikit-learn] Categorical handling
Joel Nothman
joel.nothman at gmail.com
Thu Aug 17 11:26:13 EDT 2017
I don't consider LabelBinarizer the best workaround.
Given a Pandas dataframe df, I'd use:
DictVectorizer().fit_transform(df.to_dict(orient='records'))
which will handle encoding strings with one-hot and numerical features as
column vectors. Or:
class PandasVectorizer(DictVectorizer):
def fit(self, x, y=None):
return super(PandasVectorizer, self).fit(x.to_dict('records'))
def fit_transform(self, x, y=None):
return super(PandasVectorizer,
self).fit_transform(x.to_dict('records'))
def transform(self, x):
return super(PandasVectorizer, self).transform(x.to_dict('records'))
On 18 August 2017 at 01:11, Andreas Mueller <t3kcit at gmail.com> wrote:
> Hi Georg.
> Unfortunately this is not entirely trivial right now, but will be fixed by
> https://github.com/scikit-learn/scikit-learn/pull/9151
> and
> https://github.com/scikit-learn/scikit-learn/pull/9012
> which will be in the next release (0.20).
>
> LabelBinarizer is probably the best work-around for now, and selecting
> columns can be done (awkwardly)
> like in this example: http://scikit-learn.org/dev/
> auto_examples/hetero_feature_union.html#sphx-glr-auto-
> examples-hetero-feature-union-py
>
> Best,
> Andy
>
>
> On 08/17/2017 07:50 AM, Georg Heiler wrote:
>
> Hi,
>
> how can I properly handle categorical values in scikit-learn?
> https://stackoverflow.com/questions/45727934/pandas-categories-new-levels?
> noredirect=1#comment78424496_45727934
>
> goals
>
> - scikit-learn syle fit/transform methods to encode labels of
> categorical features of X
> - should handle unseen labels
> - should be faster than running a label encoder manually for each fold
> and manually checking if the label already was seen in the training data
> i.e. what I currently do (https://stackoverflow.com/
> questions/45727934/pandas-categories-new-levels?
> noredirect=1#comment78424496_45727934
> <https://stackoverflow.com/questions/45727934/pandas-categories-new-levels?noredirect=1#comment78424496_45727934> which
> links to https://gist.github.com/geoHeil/5caff5236b4850d673b2c9b0799dc2
> ce)
> - only some columns are categorical, and only these should be converted
>
>
> Regards,
> Georg
>
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170818/15305614/attachment.html>
More information about the scikit-learn
mailing list