[scikit-learn] Categorical handling

Joel Nothman joel.nothman at gmail.com
Thu Aug 17 11:27:43 EDT 2017


gist at https://gist.github.com/jnothman/a75bac778c1eb9661017555249e50379

On 18 August 2017 at 01:26, Joel Nothman <joel.nothman at gmail.com> wrote:

> I don't consider LabelBinarizer the best workaround.
>
> Given a Pandas dataframe df, I'd use:
>
> DictVectorizer().fit_transform(df.to_dict(orient='records'))
>
> which will handle encoding strings with one-hot and numerical features as
> column vectors. Or:
>
> class PandasVectorizer(DictVectorizer):
>     def fit(self, x, y=None):
>         return super(PandasVectorizer, self).fit(x.to_dict('records'))
>     def fit_transform(self, x, y=None):
>         return super(PandasVectorizer, self).fit_transform(x.to_dict(
> 'records'))
>     def transform(self, x):
>         return super(PandasVectorizer, self).transform(x.to_dict('
> records'))
>
>
> On 18 August 2017 at 01:11, Andreas Mueller <t3kcit at gmail.com> wrote:
>
>> Hi Georg.
>> Unfortunately this is not entirely trivial right now, but will be fixed by
>> https://github.com/scikit-learn/scikit-learn/pull/9151
>> and
>> https://github.com/scikit-learn/scikit-learn/pull/9012
>> which will be in the next release (0.20).
>>
>> LabelBinarizer is probably the best work-around for now, and selecting
>> columns can be done (awkwardly)
>> like in this example: http://scikit-learn.org/dev/au
>> to_examples/hetero_feature_union.html#sphx-glr-auto-examples
>> -hetero-feature-union-py
>>
>> Best,
>> Andy
>>
>>
>> On 08/17/2017 07:50 AM, Georg Heiler wrote:
>>
>> Hi,
>>
>> how can I properly handle categorical values in scikit-learn?
>> https://stackoverflow.com/questions/45727934/pandas-categori
>> es-new-levels?noredirect=1#comment78424496_45727934
>>
>> goals
>>
>>    - scikit-learn syle fit/transform methods to encode labels of
>>    categorical features of X
>>    - should handle unseen labels
>>    - should be faster than running a label encoder manually for each
>>    fold and manually checking if the label already was seen in the training
>>    data i.e. what I currently do (https://stackoverflow.com/que
>>    stions/45727934/pandas-categories-new-levels?noredirect=1#
>>    comment78424496_45727934
>>    <https://stackoverflow.com/questions/45727934/pandas-categories-new-levels?noredirect=1#comment78424496_45727934> which
>>    links to https://gist.github.com/geoHeil/5caff5236b4850d673b2c9b07
>>    99dc2ce)
>>    - only some columns are categorical, and only these should be
>>    converted
>>
>>
>> Regards,
>> Georg
>>
>>
>> _______________________________________________
>> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170818/746b804e/attachment.html>


More information about the scikit-learn mailing list