[scikit-learn] Imputers and DataFrame objects

Ram Rachum ram at rachum.com
Mon Aug 17 03:53:36 EDT 2020


Hey guys,

This is a bit of a complicated question.

I was helping my friend do a task with Pandas/sklearn for her data science
class. I figured it'll be a breeze, since I'm fancy-pancy Python
programmer. Oh wow, it was so not.

I was trying to do things that felt simple to me, but there were so many
problems, I spent 2 hours and only had a partial solution. I'm wondering
whether I'm missing something.

She got a CSV with lots of data about cars. Some of the data had missing
values (marked with "?"). Additionally, some columns had small numbers
written as strings like "one", "two", "three", etc. There were maybe a few
more issues like these.

The task was to remove these irregularities. So for the "?" items, replace
them with mean, and for the "one", "two" etc. replace with a numerical
value.

I could easily write my own logic that does that, but she told me I should
use the tools that come with sklearn: SimpleImputer, OneHotEncoder,
BinaryEncoder for the "one" "two" "three".

They gave me so, so many problems. For one, I couldn't figure out how to
apply SimpleImputer on just one column in the DataFrame, and then get the
results in the form of a dataframe. (Either changing in-place or creating a
new DataFrame.) I think I spent an hour on this problem alone.
Eventually I found
a way <https://www.dropbox.com/preview/Desktop/Shani/floof.py>, but it
definitely felt like I was doing something wrong, like this is supposed to
be simpler.

Also, when trying to use BinaryEncoder for "one" "two" "three", it raised
an exception because there were NaN values there. Well, I wanted to first
convert them to real numbers and then use the same SimpleImputer to fix
these. But I couldn't, because of the exception.

Any insight you could give me would be useful.


Thanks,
Ram.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200817/63a3931b/attachment-0001.html>


More information about the scikit-learn mailing list