Imputers and DataFrame objects
Hey guys, This is a bit of a complicated question. I was helping my friend do a task with Pandas/sklearn for her data science class. I figured it'll be a breeze, since I'm fancy-pancy Python programmer. Oh wow, it was so not. I was trying to do things that felt simple to me, but there were so many problems, I spent 2 hours and only had a partial solution. I'm wondering whether I'm missing something. She got a CSV with lots of data about cars. Some of the data had missing values (marked with "?"). Additionally, some columns had small numbers written as strings like "one", "two", "three", etc. There were maybe a few more issues like these. The task was to remove these irregularities. So for the "?" items, replace them with mean, and for the "one", "two" etc. replace with a numerical value. I could easily write my own logic that does that, but she told me I should use the tools that come with sklearn: SimpleImputer, OneHotEncoder, BinaryEncoder for the "one" "two" "three". They gave me so, so many problems. For one, I couldn't figure out how to apply SimpleImputer on just one column in the DataFrame, and then get the results in the form of a dataframe. (Either changing in-place or creating a new DataFrame.) I think I spent an hour on this problem alone. Eventually I found a way <https://www.dropbox.com/preview/Desktop/Shani/floof.py>, but it definitely felt like I was doing something wrong, like this is supposed to be simpler. Also, when trying to use BinaryEncoder for "one" "two" "three", it raised an exception because there were NaN values there. Well, I wanted to first convert them to real numbers and then use the same SimpleImputer to fix these. But I couldn't, because of the exception. Any insight you could give me would be useful. Thanks, Ram.
Hi Ram, These are great questions!
The task was to remove these irregularities. So for the "?" items, replace them with mean, and for the "one", "two" etc. replace with a numerical value.
If your primary task is "data cleaning", then pandas is usually the optimal tool. If "preprocessing your data for Machine Learning" is your primary task, then scikit-learn is usually the optimal tool. There is some overlap between what is considered "cleaning" and "preprocessing", but I mention this distinction because it can help you decide what tool to use.
she told me I should use the tools that come with sklearn: SimpleImputer, OneHotEncoder, BinaryEncoder for the "one" "two" "three".
Just for clarification, BinaryEncoder is not part of scikit-learn. Instead, it's part of the Category Encoders library, which is a related project to scikit-learn.
For one, I couldn't figure out how to apply SimpleImputer on just one
column in the DataFrame, and then get the results in the form of a dataframe.
Like most scikit-learn transformers, SimpleImputer expects 2-dimensional input. In your case, this would be a 1-column DataFrame (such as df[['col']]) rather than a Series (such as df['col']). Also like most scikit-learn transformers, SimpleImputer outputs a NumPy array. If you need the output to be a DataFrame, one option is to convert the array to a pandas object and concatenate it to the original DataFrame.
Also, when trying to use BinaryEncoder for "one" "two" "three", it raised an exception because there were NaN values there.
Neither OneHotEncoder nor BinaryEncoder will help you to replace these string values with the corresponding numbers. Instead, I recommend using the pandas DataFrame map method. Alternatively, if you need to do this mapping operation within scikit-learn, you could wrap the pandas functionality into a custom scikit-learn transformer using FunctionTransformer. That is a bit more complicated, though it does have the benefit that you can chain it into a Pipeline with a SimpleImputer. But again, this is more complicated and is not the recommended approach unless you are already fluent with the scikit-learn API.
Any insight you could give me would be useful.
It sounds like using pandas for the tasks you described is the optimal approach, but I'm basing that opinion purely on what I know from your email. Hope that helps! Kevin On Mon, Aug 17, 2020 at 3:54 AM Ram Rachum <ram@rachum.com> wrote:
Hey guys,
This is a bit of a complicated question.
I was helping my friend do a task with Pandas/sklearn for her data science class. I figured it'll be a breeze, since I'm fancy-pancy Python programmer. Oh wow, it was so not.
I was trying to do things that felt simple to me, but there were so many problems, I spent 2 hours and only had a partial solution. I'm wondering whether I'm missing something.
She got a CSV with lots of data about cars. Some of the data had missing values (marked with "?"). Additionally, some columns had small numbers written as strings like "one", "two", "three", etc. There were maybe a few more issues like these.
The task was to remove these irregularities. So for the "?" items, replace them with mean, and for the "one", "two" etc. replace with a numerical value.
I could easily write my own logic that does that, but she told me I should use the tools that come with sklearn: SimpleImputer, OneHotEncoder, BinaryEncoder for the "one" "two" "three".
They gave me so, so many problems. For one, I couldn't figure out how to apply SimpleImputer on just one column in the DataFrame, and then get the results in the form of a dataframe. (Either changing in-place or creating a new DataFrame.) I think I spent an hour on this problem alone. Eventually I found a way <https://www.dropbox.com/preview/Desktop/Shani/floof.py>, but it definitely felt like I was doing something wrong, like this is supposed to be simpler.
Also, when trying to use BinaryEncoder for "one" "two" "three", it raised an exception because there were NaN values there. Well, I wanted to first convert them to real numbers and then use the same SimpleImputer to fix these. But I couldn't, because of the exception.
Any insight you could give me would be useful.
Thanks, Ram. _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- Kevin Markham Founder, Data School https://www.dataschool.io https://www.youtube.com/dataschool https://www.patreon.com/dataschool
On Mon, Aug 17, 2020 at 8:55 PM Kevin Markham <kevin@dataschool.io> wrote:
Hi Ram,
These are great questions!
Thank you for the detailed answers.
The task was to remove these irregularities. So for the "?" items, replace them with mean, and for the "one", "two" etc. replace with a numerical value.
If your primary task is "data cleaning", then pandas is usually the optimal tool. If "preprocessing your data for Machine Learning" is your primary task, then scikit-learn is usually the optimal tool. There is some overlap between what is considered "cleaning" and "preprocessing", but I mention this distinction because it can help you decide what tool to use.
Okay, but here's one example where it gets tricky. For a column with numbers written like "one", "two" and missing values "?", I had to do two things: Change them to numbers (1, 2), and then, instead of the missing values, add the most common element, or mean or whatever. When I tried to use LabelEncoder to do the first part, it complained about the missing values. I couldn't fix these missing values until the labels were changed to ints. So that put me in a frustrating Catch-22 situation, and all the while I'm thinking "It would be so much simpler to just write my own logic in a for-loop rather than try to get Pandas and scikit-learn working together. Any insights about that?
For one, I couldn't figure out how to apply SimpleImputer on just one
column in the DataFrame, and then get the results in the form of a dataframe.
Like most scikit-learn transformers, SimpleImputer expects 2-dimensional input. In your case, this would be a 1-column DataFrame (such as df[['col']]) rather than a Series (such as df['col']).
Also like most scikit-learn transformers, SimpleImputer outputs a NumPy array. If you need the output to be a DataFrame, one option is to convert the array to a pandas object and concatenate it to the original DataFrame.
Well, I did do that in the `process_column` helper function in the code I linked to above. But it kind of felt like... What am I using a framework for to begin with? Because that kind of logistics is the reason I want to use a framework instead of managing my own arrays and imputing logic. Thanks for your help Kevin.
Hi Ram,
For a column with numbers written like "one", "two" and missing values "?", I had to do two things: Change them to numbers (1, 2), and then, instead of the missing values, add the most common element, or mean or whatever. When I tried to use LabelEncoder to do the first part, it complained about the missing values.
LabelEncoder is not the right tool for this task. It does map strings to integers, but it's not a tool for mapping *particular* strings to *particular* integers. More generally: LabelEncoder is a tool for encoding a label, not a tool for data cleaning (which is how I would describe your task).
all the while I'm thinking "It would be so much simpler to just write my own logic in a for-loop rather than try to get Pandas and scikit-learn working together.
I wouldn't describe this as a case in which "pandas and scikit-learn aren't working well together." Rather, I would describe this as a case of trying to use a scikit-learn function when what you actually need is a pandas function. Here's a solution to your problem in two lines of pandas code: df['col'] = df['col'].map({'one':1, 'two':2, '?':np.nan}) df['col'] = df['col'].fillna(df['col'].mean()) Showing you that there is a simple solution is not a critique of you. Rather, pandas and scikit-learn are complex tools with huge APIs, and it takes time to master them. And to be clear, I'm not critiquing the tools either: they are complex tools with huge APIs because they are addressing complex problems with lots of functional areas.
But it kind of felt like... What am I using a framework for to begin with?
I think you will find that pandas and scikit-learn can save you a lot of code, but it does require finding the right function or class. Learning these tools requires an investment of time, and many people have found that this investment is well worth it. However, solving your problems with custom code is always an option, and it's totally fine if that is your preferred option! Hope that helps, Kevin On Tue, Aug 18, 2020 at 7:56 AM Ram Rachum <ram@rachum.com> wrote:
On Mon, Aug 17, 2020 at 8:55 PM Kevin Markham <kevin@dataschool.io> wrote:
Hi Ram,
These are great questions!
Thank you for the detailed answers.
The task was to remove these irregularities. So for the "?" items, replace them with mean, and for the "one", "two" etc. replace with a numerical value.
If your primary task is "data cleaning", then pandas is usually the optimal tool. If "preprocessing your data for Machine Learning" is your primary task, then scikit-learn is usually the optimal tool. There is some overlap between what is considered "cleaning" and "preprocessing", but I mention this distinction because it can help you decide what tool to use.
Okay, but here's one example where it gets tricky. For a column with numbers written like "one", "two" and missing values "?", I had to do two things: Change them to numbers (1, 2), and then, instead of the missing values, add the most common element, or mean or whatever. When I tried to use LabelEncoder to do the first part, it complained about the missing values. I couldn't fix these missing values until the labels were changed to ints. So that put me in a frustrating Catch-22 situation, and all the while I'm thinking "It would be so much simpler to just write my own logic in a for-loop rather than try to get Pandas and scikit-learn working together.
Any insights about that?
For one, I couldn't figure out how to apply SimpleImputer on just one
column in the DataFrame, and then get the results in the form of a dataframe.
Like most scikit-learn transformers, SimpleImputer expects 2-dimensional input. In your case, this would be a 1-column DataFrame (such as df[['col']]) rather than a Series (such as df['col']).
Also like most scikit-learn transformers, SimpleImputer outputs a NumPy array. If you need the output to be a DataFrame, one option is to convert the array to a pandas object and concatenate it to the original DataFrame.
Well, I did do that in the `process_column` helper function in the code I linked to above. But it kind of felt like... What am I using a framework for to begin with? Because that kind of logistics is the reason I want to use a framework instead of managing my own arrays and imputing logic.
Thanks for your help Kevin.
-- Kevin Markham Founder, Data School https://www.dataschool.io https://www.youtube.com/dataschool https://www.patreon.com/dataschool
On Tue, Aug 18, 2020 at 6:53 PM Kevin Markham <kevin@dataschool.io> wrote:
Hi Ram,
For a column with numbers written like "one", "two" and missing values "?", I had to do two things: Change them to numbers (1, 2), and then, instead of the missing values, add the most common element, or mean or whatever. When I tried to use LabelEncoder to do the first part, it complained about the missing values.
LabelEncoder is not the right tool for this task. It does map strings to integers, but it's not a tool for mapping *particular* strings to *particular* integers. More generally: LabelEncoder is a tool for encoding a label, not a tool for data cleaning (which is how I would describe your task).
all the while I'm thinking "It would be so much simpler to just write my own logic in a for-loop rather than try to get Pandas and scikit-learn working together.
I wouldn't describe this as a case in which "pandas and scikit-learn aren't working well together." Rather, I would describe this as a case of trying to use a scikit-learn function when what you actually need is a pandas function.
Here's a solution to your problem in two lines of pandas code: df['col'] = df['col'].map({'one':1, 'two':2, '?':np.nan}) df['col'] = df['col'].fillna(df['col'].mean())
Showing you that there is a simple solution is not a critique of you. Rather, pandas and scikit-learn are complex tools with huge APIs, and it takes time to master them. And to be clear, I'm not critiquing the tools either: they are complex tools with huge APIs because they are addressing complex problems with lots of functional areas.
I understand, that makes sense. Thank you.
But it kind of felt like... What am I using a framework for to begin with?
I think you will find that pandas and scikit-learn can save you a lot of code, but it does require finding the right function or class. Learning these tools requires an investment of time, and many people have found that this investment is well worth it.
However, solving your problems with custom code is always an option, and it's totally fine if that is your preferred option!
Hope that helps,
Kevin
Thanks for your help Kevin.
Did you have a look at the package feature-engine? It has its own imputers and encoders that allow you to select the columns to transform and returns a dataframe. It also has a sklear wrapper that wraps sklearn transformers so that they return a dataframe instead of a numpy array. Cheers. Sole Sent from ProtonMail mobile -------- Original Message -------- On 18 Aug 2020, 13:56, Ram Rachum wrote:
On Mon, Aug 17, 2020 at 8:55 PM Kevin Markham <kevin@dataschool.io> wrote:
Hi Ram,
These are great questions!
Thank you for the detailed answers.
The task was to remove these irregularities. So for the "?" items, replace them with mean, and for the "one", "two" etc. replace with a numerical value.
If your primary task is "data cleaning", then pandas is usually the optimal tool. If "preprocessing your data for Machine Learning" is your primary task, then scikit-learn is usually the optimal tool. There is some overlap between what is considered "cleaning" and "preprocessing", but I mention this distinction because it can help you decide what tool to use.
Okay, but here's one example where it gets tricky. For a column with numbers written like "one", "two" and missing values "?", I had to do two things: Change them to numbers (1, 2), and then, instead of the missing values, add the most common element, or mean or whatever. When I tried to use LabelEncoder to do the first part, it complained about the missing values. I couldn't fix these missing values until the labels were changed to ints. So that put me in a frustrating Catch-22 situation, and all the while I'm thinking "It would be so much simpler to just write my own logic in a for-loop rather than try to get Pandas and scikit-learn working together.
Any insights about that?
For one, I couldn't figure out how to apply SimpleImputer on just one column in the DataFrame, and then get the results in the form of a dataframe.
Like most scikit-learn transformers, SimpleImputer expects 2-dimensional input. In your case, this would be a 1-column DataFrame (such as df[['col']]) rather than a Series (such as df['col']).
Also like most scikit-learn transformers, SimpleImputer outputs a NumPy array. If you need the output to be a DataFrame, one option is to convert the array to a pandas object and concatenate it to the original DataFrame.
Well, I did do that in the `process_column` helper function in the code I linked to above. But it kind of felt like... What am I using a framework for to begin with? Because that kind of logistics is the reason I want to use a framework instead of managing my own arrays and imputing logic.
Thanks for your help Kevin.
I'll check it out. Thank you. On Wed, Aug 19, 2020 at 9:46 AM Sole Galli via scikit-learn < scikit-learn@python.org> wrote:
Did you have a look at the package feature-engine? It has its own imputers and encoders that allow you to select the columns to transform and returns a dataframe. It also has a sklear wrapper that wraps sklearn transformers so that they return a dataframe instead of a numpy array.
Cheers.
Sole
Sent from ProtonMail mobile
-------- Original Message -------- On 18 Aug 2020, 13:56, Ram Rachum < ram@rachum.com> wrote:
On Mon, Aug 17, 2020 at 8:55 PM Kevin Markham <kevin@dataschool.io> wrote:
Hi Ram,
These are great questions!
Thank you for the detailed answers.
The task was to remove these irregularities. So for the "?" items, replace them with mean, and for the "one", "two" etc. replace with a numerical value.
If your primary task is "data cleaning", then pandas is usually the optimal tool. If "preprocessing your data for Machine Learning" is your primary task, then scikit-learn is usually the optimal tool. There is some overlap between what is considered "cleaning" and "preprocessing", but I mention this distinction because it can help you decide what tool to use.
Okay, but here's one example where it gets tricky. For a column with numbers written like "one", "two" and missing values "?", I had to do two things: Change them to numbers (1, 2), and then, instead of the missing values, add the most common element, or mean or whatever. When I tried to use LabelEncoder to do the first part, it complained about the missing values. I couldn't fix these missing values until the labels were changed to ints. So that put me in a frustrating Catch-22 situation, and all the while I'm thinking "It would be so much simpler to just write my own logic in a for-loop rather than try to get Pandas and scikit-learn working together.
Any insights about that?
For one, I couldn't figure out how to apply SimpleImputer on just one
column in the DataFrame, and then get the results in the form of a dataframe.
Like most scikit-learn transformers, SimpleImputer expects 2-dimensional input. In your case, this would be a 1-column DataFrame (such as df[['col']]) rather than a Series (such as df['col']).
Also like most scikit-learn transformers, SimpleImputer outputs a NumPy array. If you need the output to be a DataFrame, one option is to convert the array to a pandas object and concatenate it to the original DataFrame.
Well, I did do that in the `process_column` helper function in the code I linked to above. But it kind of felt like... What am I using a framework for to begin with? Because that kind of logistics is the reason I want to use a framework instead of managing my own arrays and imputing logic.
Thanks for your help Kevin. _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
participants (3)
-
Kevin Markham -
Ram Rachum -
Sole Galli