[scikit-learn] transform categorical data to numerical representation

Joel Nothman joel.nothman at gmail.com
Sat Aug 5 18:47:23 EDT 2017


We are working on CategoricalEncoder in
https://github.com/scikit-learn/scikit-learn/pull/9151 to help users more
with this kind of thing. Feedback and testing is welcome.

On 6 August 2017 at 02:13, Sebastian Raschka <se.raschka at gmail.com> wrote:

> Hi, Georg,
>
> I bring this up every time here on the mailing list :), and you probably
> aware of this issue, but it makes a difference whether your categorical
> data is nominal or ordinal. For instance if you have an ordinal variable
> like with values like {small, medium, large} you probably want to encode it
> as {1, 2, 3} or {1, 20, 100} or whatever is appropriate based on your
> domain knowledge regarding the variable. If you have sth like {blue, red,
> green} it may make more sense to do a one-hot encoding so that the
> classifier doesn't assume  a relationship between the variables like blue >
> red > green or sth like that.
>
> Now, the DictVectorizer and OneHotEncoder are both doing one hot encoding.
> The LabelEncoder does convert a variable to integer values, but if you have
> sth like {small, medium, large}, it wouldn't know the order (if that's an
> ordinal variable) and it would just assign arbitrary integers in increasing
> order. Thus, if you are dealing ordinal variables, there's no way around
> doing this manually; for example you could create mapping dictionaries for
> that (most conveniently done in pandas).
>
> Best,
> Sebastian
>
> > On Aug 5, 2017, at 5:10 AM, Georg Heiler <georg.kf.heiler at gmail.com>
> wrote:
> >
> > Hi,
> >
> > the LabelEncooder is only meant for a single column i.e. target
> variable. Is the DictVectorizeer or a manual chaining of multiple
> LabelEncoders (one per categorical column) the desired way to get values
> which can be fed into a subsequent classifier?
> >
> > Is there some way I have overlooked which works better and possibly also
> can handle unseen values by applying most frequent imputation?
> >
> > regards,
> > Georg
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170806/f7a0a293/attachment.html>


More information about the scikit-learn mailing list