[scikit-learn] Imputers and DataFrame objects
Sole Galli
solegalli at protonmail.com
Wed Aug 19 02:35:41 EDT 2020
Did you have a look at the package feature-engine? It has its own imputers and encoders that allow you to select the columns to transform and returns a dataframe. It also has a sklear wrapper that wraps sklearn transformers so that they return a dataframe instead of a numpy array.
Cheers.
Sole
Sent from ProtonMail mobile
-------- Original Message --------
On 18 Aug 2020, 13:56, Ram Rachum wrote:
> On Mon, Aug 17, 2020 at 8:55 PM Kevin Markham <kevin at dataschool.io> wrote:
>
>> Hi Ram,
>>
>> These are great questions!
>
> Thank you for the detailed answers.
>
>>> The task was to remove these irregularities. So for the "?" items, replace them with mean, and for the "one", "two" etc. replace with a numerical value.
>>
>> If your primary task is "data cleaning", then pandas is usually the optimal tool. If "preprocessing your data for Machine Learning" is your primary task, then scikit-learn is usually the optimal tool. There is some overlap between what is considered "cleaning" and "preprocessing", but I mention this distinction because it can help you decide what tool to use.
>
> Okay, but here's one example where it gets tricky. For a column with numbers written like "one", "two" and missing values "?", I had to do two things: Change them to numbers (1, 2), and then, instead of the missing values, add the most common element, or mean or whatever. When I tried to use LabelEncoder to do the first part, it complained about the missing values. I couldn't fix these missing values until the labels were changed to ints. So that put me in a frustrating Catch-22 situation, and all the while I'm thinking "It would be so much simpler to just write my own logic in a for-loop rather than try to get Pandas and scikit-learn working together.
>
> Any insights about that?
>
>>> For one, I couldn't figure out how to apply SimpleImputer on just one column in the DataFrame, and then get the results in the form of a dataframe.
>>
>> Like most scikit-learn transformers, SimpleImputer expects 2-dimensional input. In your case, this would be a 1-column DataFrame (such as df[['col']]) rather than a Series (such as df['col']).
>>
>> Also like most scikit-learn transformers, SimpleImputer outputs a NumPy array. If you need the output to be a DataFrame, one option is to convert the array to a pandas object and concatenate it to the original DataFrame.
>
> Well, I did do that in the `process_column` helper function in the code I linked to above. But it kind of felt like... What am I using a framework for to begin with? Because that kind of logistics is the reason I want to use a framework instead of managing my own arrays and imputing logic.
>
> Thanks for your help Kevin.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200819/ff812a82/attachment.html>
More information about the scikit-learn
mailing list