<div dir="ltr"><div dir="ltr">Hi Ram,<div><br></div><div>These are great questions!</div><div><br></div><div>> The task was to remove these irregularities. So for the "?" items, replace them with mean, and for the "one", "two" etc. replace with a numerical value.</div><br><div>If your primary task is "data cleaning", then pandas is usually the optimal tool. If "preprocessing your data for Machine Learning" is your primary task, then scikit-learn is usually the optimal tool. There is some overlap between what is considered "cleaning" and "preprocessing", but I mention this distinction because it can help you decide what tool to use.</div><div><br></div><div>> she told me I should use the tools that come with sklearn: SimpleImputer, OneHotEncoder, BinaryEncoder for the "one" "two" "three".</div><div><br></div><div>Just for clarification, BinaryEncoder is not part of scikit-learn. Instead, it's part of the Category Encoders library, which is a related project to scikit-learn.</div><div><br></div><div>> For one, I couldn't figure out how to apply SimpleImputer on just one column in the DataFrame, and then get the results in the form of a dataframe.</div><div><br></div><div>Like most scikit-learn transformers, SimpleImputer expects 2-dimensional input. In your case, this would be a 1-column DataFrame (such as df[['col']]) rather than a Series (such as df['col']).</div><div><br></div><div>Also like most scikit-learn transformers, SimpleImputer outputs a NumPy array. If you need the output to be a DataFrame, one option is to convert the array to a pandas object and concatenate it to the original DataFrame.</div><div><br></div><div>> Also, when trying to use BinaryEncoder for "one" "two" "three", it raised an exception because there were NaN values there.</div><div><br></div><div>Neither OneHotEncoder nor BinaryEncoder will help you to replace these string values with the corresponding numbers. Instead, I recommend using the pandas DataFrame map method.</div><div><br></div><div>Alternatively, if you need to do this mapping operation within scikit-learn, you could wrap the pandas functionality into a custom scikit-learn transformer using FunctionTransformer. That is a bit more complicated, though it does have the benefit that you can chain it into a Pipeline with a SimpleImputer. But again, this is more complicated and is not the recommended approach unless you are already fluent with the scikit-learn API.</div><div><br></div><div>> Any insight you could give me would be useful.</div><div><br></div><div>It sounds like using pandas for the tasks you described is the optimal approach, but I'm basing that opinion purely on what I know from your email.</div><div><br></div><div>Hope that helps!</div><div><br></div><div>Kevin</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Aug 17, 2020 at 3:54 AM Ram Rachum <<a href="mailto:ram@rachum.com">ram@rachum.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hey guys,<br><br>This is a bit of a complicated question.<br><br>I was helping my friend do a task with Pandas/sklearn for her data science class. I figured it'll be a breeze, since I'm fancy-pancy Python programmer. Oh wow, it was so not.<br><br>I was trying to do things that felt simple to me, but there were so many problems, I spent 2 hours and only had a partial solution. I'm wondering whether I'm missing something.<br><br>She got a CSV with lots of data about cars. Some of the data had missing values (marked with "?"). Additionally, some columns had small numbers written as strings like "one", "two", "three", etc. There were maybe a few more issues like these.<br><br>The task was to remove these irregularities. So for the "?" items, replace them with mean, and for the "one", "two" etc. replace with a numerical value.<br><br>I could easily write my own logic that does that, but she told me I should use the tools that come with sklearn: SimpleImputer, OneHotEncoder, BinaryEncoder for the "one" "two" "three".<br><br>They gave me so, so many problems. For one, I couldn't figure out how to apply SimpleImputer on just one column in the DataFrame, and then get the results in the form of a dataframe. (Either changing in-place or creating a new DataFrame.) I think I spent an hour on this problem alone. Eventually I <a href="https://www.dropbox.com/preview/Desktop/Shani/floof.py" target="_blank">found a way</a>, but it definitely felt like I was doing something wrong, like this is supposed to be simpler.<br><br>Also, when trying to use BinaryEncoder for "one" "two" "three", it raised an exception because there were NaN values there. Well, I wanted to first convert them to real numbers and then use the same SimpleImputer to fix these. But I couldn't, because of the exception.<br><br>Any insight you could give me would be useful.<br><br><br>Thanks,<br>Ram.<br></div>
_______________________________________________<br>
scikit-learn mailing list<br>
<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a><br>
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/scikit-learn</a><br>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr">Kevin Markham<div>Founder, Data School</div><div><a href="https://www.dataschool.io" target="_blank">https://www.dataschool.io</a></div><div><a href="https://www.youtube.com/dataschool" target="_blank">https://www.youtube.com/dataschool</a><br></div><div><a href="https://www.patreon.com/dataschool" target="_blank">https://www.patreon.com/dataschool</a><br></div></div></div></div>