<div dir="ltr">I'll check it out. Thank you. </div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Aug 19, 2020 at 9:46 AM Sole Galli via scikit-learn <<a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Did you have a look at the package feature-engine? It has its own imputers and encoders that allow you to select the columns to transform and returns a dataframe. It also has a sklear wrapper that wraps sklearn transformers so that they return a dataframe instead of a numpy array. <br><br>Cheers.<br><br>Sole <br><br><br>Sent from ProtonMail mobile<br><br><br><br>-------- Original Message --------<br>On 18 Aug 2020, 13:56, Ram Rachum < <a href="mailto:ram@rachum.com" target="_blank">ram@rachum.com</a>> wrote:<blockquote><br><div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Aug 17, 2020 at 8:55 PM Kevin Markham <<a href="mailto:kevin@dataschool.io" target="_blank">kevin@dataschool.io</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr">Hi Ram,<div><br></div><div>These are great questions!</div></div></div></blockquote><div><br></div><div>Thank you for the detailed answers. </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div><br></div><div>> The task was to remove these irregularities. So for the "?" items, replace them with mean, and for the "one", "two" etc. replace with a numerical value.</div><br><div>If your primary task is "data cleaning", then pandas is usually the optimal tool. If "preprocessing your data for Machine Learning" is your primary task, then scikit-learn is usually the optimal tool. There is some overlap between what is considered "cleaning" and "preprocessing", but I mention this distinction because it can help you decide what tool to use.</div></div></div></blockquote><div><br></div><div>Okay, but here's one example where it gets tricky. For a column with numbers written like "one", "two" and missing values "?", I had to do two things: Change them to numbers (1, 2), and then, instead of the missing values, add the most common element, or mean or whatever. When I tried to use LabelEncoder to do the first part, it complained about the missing values. I couldn't fix these missing values until the labels were changed to ints. So that put me in a frustrating Catch-22 situation, and all the while I'm thinking "It would be so much simpler to just write my own logic in a for-loop rather than try to get Pandas and scikit-learn working together.</div><div><br></div><div>Any insights about that? </div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div>> For one, I couldn't figure out how to apply SimpleImputer on just one column in the DataFrame, and then get the results in the form of a dataframe.</div><div><br></div><div>Like most scikit-learn transformers, SimpleImputer expects 2-dimensional input. In your case, this would be a 1-column DataFrame (such as df[['col']]) rather than a Series (such as df['col']).</div><div><br></div><div>Also like most scikit-learn transformers, SimpleImputer outputs a NumPy array. If you need the output to be a DataFrame, one option is to convert the array to a pandas object and concatenate it to the original DataFrame.</div></div></div></blockquote><div><br></div><div>Well, I did do that in the `process_column` helper function in the code I linked to above. But it kind of felt like... What am I using a framework for to begin with? Because that kind of logistics is the reason I want to use a framework instead of managing my own arrays and imputing logic.</div><div> </div><div>Thanks for your help Kevin.</div></div></div>
_______________________________________________<br>
scikit-learn mailing list<br>
<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a><br>
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/scikit-learn</a><br>
</blockquote></blockquote></div>