[scikit-learn] MultiLabelBinarizer gives individual characters instead of the classes

Thu Sep 12 01:24:48 EDT 2019

I think this caveat has been added in the dev doc (not yet in the stable
doc). You may want to read:
https://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html
and in particular the part that starts with "A common mistake is to pass
in a list".

Cheers,
Loïc

> Hi.
>
> I am working on a Multi-label text classification problem. In order to encode the labels, I am using MultiLabelBinarizer. The labels of the dataset look like -
>
> image
>
> When I am using
>
> mlb = MultiLabelBinarizer()
> mlb.fit(labels)
> print(mlb.classes_)
>
> I am getting -
>
> image
>
> Whereas, the output (sample output) I want is -
>
> image
>
> I got the above output by -
>
> mlb = MultiLabelBinarizer()
> sample_labels = [
>     ['stat.ML', 'cs.LG'],
>     ['cs.CV', 'cs.RO']
> ]
> mlb.fit(sample_labels)
> print(mlb.classes_)
>
> Help would be very much appreciated here.
>
> Here's the dataset I had prepared:
> arXivdata.csv.zip
>
> I stripped away the double quotes in the labels after loading it in a pandas DataFrame by -
>
> import re 
>
> arxiv_data['labels'] = arxiv_data['labels'].str.replace(r"[\"]", '')
>
> scikit-learn version: '0.21.3'
>
> Sayak Paul | sayak.dev