[scikit-learn] Imputers and DataFrame objects
Ram Rachum
ram at rachum.com
Wed Aug 19 03:35:46 EDT 2020
I'll check it out. Thank you.
On Wed, Aug 19, 2020 at 9:46 AM Sole Galli via scikit-learn <
scikit-learn at python.org> wrote:
> Did you have a look at the package feature-engine? It has its own imputers
> and encoders that allow you to select the columns to transform and returns
> a dataframe. It also has a sklear wrapper that wraps sklearn transformers
> so that they return a dataframe instead of a numpy array.
>
> Cheers.
>
> Sole
>
>
> Sent from ProtonMail mobile
>
>
>
> -------- Original Message --------
> On 18 Aug 2020, 13:56, Ram Rachum < ram at rachum.com> wrote:
>
>
>
>
> On Mon, Aug 17, 2020 at 8:55 PM Kevin Markham <kevin at dataschool.io> wrote:
>
>> Hi Ram,
>>
>> These are great questions!
>>
>
> Thank you for the detailed answers.
>
>>
>> > The task was to remove these irregularities. So for the "?" items,
>> replace them with mean, and for the "one", "two" etc. replace with a
>> numerical value.
>>
>> If your primary task is "data cleaning", then pandas is usually the
>> optimal tool. If "preprocessing your data for Machine Learning" is your
>> primary task, then scikit-learn is usually the optimal tool. There is some
>> overlap between what is considered "cleaning" and "preprocessing", but I
>> mention this distinction because it can help you decide what tool to use.
>>
>
> Okay, but here's one example where it gets tricky. For a column with
> numbers written like "one", "two" and missing values "?", I had to do two
> things: Change them to numbers (1, 2), and then, instead of the missing
> values, add the most common element, or mean or whatever. When I tried to
> use LabelEncoder to do the first part, it complained about the missing
> values. I couldn't fix these missing values until the labels were changed
> to ints. So that put me in a frustrating Catch-22 situation, and all the
> while I'm thinking "It would be so much simpler to just write my own logic
> in a for-loop rather than try to get Pandas and scikit-learn working
> together.
>
> Any insights about that?
>
>
>> > For one, I couldn't figure out how to apply SimpleImputer on just one
>> column in the DataFrame, and then get the results in the form of a
>> dataframe.
>>
>> Like most scikit-learn transformers, SimpleImputer expects 2-dimensional
>> input. In your case, this would be a 1-column DataFrame (such as
>> df[['col']]) rather than a Series (such as df['col']).
>>
>> Also like most scikit-learn transformers, SimpleImputer outputs a NumPy
>> array. If you need the output to be a DataFrame, one option is to convert
>> the array to a pandas object and concatenate it to the original DataFrame.
>>
>
> Well, I did do that in the `process_column` helper function in the code I
> linked to above. But it kind of felt like... What am I using a framework
> for to begin with? Because that kind of logistics is the reason I want to
> use a framework instead of managing my own arrays and imputing logic.
>
> Thanks for your help Kevin.
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200819/b39dc328/attachment-0001.html>
More information about the scikit-learn
mailing list