Hello - I was just wondering if there was a way to improve performance on the one-hot encoder. Or, is there any plans to do so in the future? I am working with a matrix that will ultimately have 20 million categorical variables, and my bottleneck is the one-hot encoder. Let me know if this isn't the place to inquire. My code is very simple when using the encoder, but I cut and pasted it here for completeness. enc = OneHotEncoder(sparse=True) Xtrain = enc.fit_transform(tiledata) Thanks, Sarah
20 million categories, or 20 million categorical variables? OneHotEncoder is pretty efficient if you specify n_values. On 5 February 2018 at 15:10, Sarah Wait Zaranek <sarah.zaranek@gmail.com> wrote:
Hello -
I was just wondering if there was a way to improve performance on the one-hot encoder. Or, is there any plans to do so in the future? I am working with a matrix that will ultimately have 20 million categorical variables, and my bottleneck is the one-hot encoder.
Let me know if this isn't the place to inquire. My code is very simple when using the encoder, but I cut and pasted it here for completeness.
enc = OneHotEncoder(sparse=True) Xtrain = enc.fit_transform(tiledata)
Thanks, Sarah
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
You will also benefit from assume_finite (see http://scikit-learn.org/stable/modules/generated/sklearn.config_context.html )
Sorry - your second message popped up when I was writing my response. I will look at this as well. Thanks for being so speedy! Cheers, Sarah On Sun, Feb 4, 2018 at 11:30 PM, Joel Nothman <joel.nothman@gmail.com> wrote:
You will also benefit from assume_finite (see http://scikit-learn.org/ stable/modules/generated/sklearn.config_context.html)
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Hi Joel - 20 million categorical variables. It comes from segmenting the genome into 20 million parts. Genomes are big :) For n_values, I am a bit confused. Is the input the same as the output for n values. Originally, I thought it was just the number of levels per column, but it seems like it is more like the highest value of the levels (in terms of integers). Cheers, Sarah On Sun, Feb 4, 2018 at 11:27 PM, Joel Nothman <joel.nothman@gmail.com> wrote:
20 million categories, or 20 million categorical variables?
OneHotEncoder is pretty efficient if you specify n_values.
On 5 February 2018 at 15:10, Sarah Wait Zaranek <sarah.zaranek@gmail.com> wrote:
Hello -
I was just wondering if there was a way to improve performance on the one-hot encoder. Or, is there any plans to do so in the future? I am working with a matrix that will ultimately have 20 million categorical variables, and my bottleneck is the one-hot encoder.
Let me know if this isn't the place to inquire. My code is very simple when using the encoder, but I cut and pasted it here for completeness.
enc = OneHotEncoder(sparse=True) Xtrain = enc.fit_transform(tiledata)
Thanks, Sarah
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
If each input column is encoded as a value from 0 to the (number of possible values for that column - 1) then n_values for that column should be the highest value + 1, which is also the number of levels per column. Does that make sense? Actually, I've realised there's a somewhat slow and unnecessary bit of code in the one-hot encoder: where the COO matrix is converted to CSR. I suspect this was done because most of our ML algorithms perform better on CSR, or else to maintain backwards compatibility with an earlier implementation.
Hi Joel - Conceptually, that makes sense. But when I assign n_values, I can't make it match the result when you don't specify them. See below. I used the number of unique levels per column.
enc = OneHotEncoder(sparse=False) test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, 2]]) test array([[0., 0., 1., 1., 0., 0., 0., 0., 1.], [0., 1., 0., 0., 1., 1., 0., 0., 0.], [1., 0., 0., 0., 1., 0., 1., 0., 0.], [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) enc = OneHotEncoder(sparse=False,n_values=[3,2,4]) test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, 2]]) test array([[0., 0., 0., 1., 0., 0., 0., 1., 1.], [0., 1., 0., 0., 0., 2., 0., 0., 0.], [1., 0., 0., 0., 0., 1., 1., 0., 0.], [0., 1., 0., 1., 0., 0., 0., 1., 0.]])
Cheers, Sarah Cheers, Sarah On Mon, Feb 5, 2018 at 12:02 AM, Joel Nothman <joel.nothman@gmail.com> wrote:
If each input column is encoded as a value from 0 to the (number of possible values for that column - 1) then n_values for that column should be the highest value + 1, which is also the number of levels per column. Does that make sense?
Actually, I've realised there's a somewhat slow and unnecessary bit of code in the one-hot encoder: where the COO matrix is converted to CSR. I suspect this was done because most of our ML algorithms perform better on CSR, or else to maintain backwards compatibility with an earlier implementation.
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
If I use the n+1 approach, then I get the correct matrix, except with the columns of zeros:
test array([[0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.], [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.]])
On Mon, Feb 5, 2018 at 12:25 AM, Sarah Wait Zaranek <sarah.zaranek@gmail.com
wrote:
Hi Joel -
Conceptually, that makes sense. But when I assign n_values, I can't make it match the result when you don't specify them. See below. I used the number of unique levels per column.
enc = OneHotEncoder(sparse=False) test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, 2]]) test array([[0., 0., 1., 1., 0., 0., 0., 0., 1.], [0., 1., 0., 0., 1., 1., 0., 0., 0.], [1., 0., 0., 0., 1., 0., 1., 0., 0.], [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) enc = OneHotEncoder(sparse=False,n_values=[3,2,4]) test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, 2]]) test array([[0., 0., 0., 1., 0., 0., 0., 1., 1.], [0., 1., 0., 0., 0., 2., 0., 0., 0.], [1., 0., 0., 0., 0., 1., 1., 0., 0.], [0., 1., 0., 1., 0., 0., 0., 1., 0.]])
Cheers, Sarah
Cheers, Sarah
On Mon, Feb 5, 2018 at 12:02 AM, Joel Nothman <joel.nothman@gmail.com> wrote:
If each input column is encoded as a value from 0 to the (number of possible values for that column - 1) then n_values for that column should be the highest value + 1, which is also the number of levels per column. Does that make sense?
Actually, I've realised there's a somewhat slow and unnecessary bit of code in the one-hot encoder: where the COO matrix is converted to CSR. I suspect this was done because most of our ML algorithms perform better on CSR, or else to maintain backwards compatibility with an earlier implementation.
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
If you specify n_values=[list_of_vals_for_column1, list_of_vals_for_column2], you should be able to engineer it to how you want. On 5 February 2018 at 16:31, Sarah Wait Zaranek <sarah.zaranek@gmail.com> wrote:
If I use the n+1 approach, then I get the correct matrix, except with the columns of zeros:
test array([[0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.], [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.]])
On Mon, Feb 5, 2018 at 12:25 AM, Sarah Wait Zaranek < sarah.zaranek@gmail.com> wrote:
Hi Joel -
Conceptually, that makes sense. But when I assign n_values, I can't make it match the result when you don't specify them. See below. I used the number of unique levels per column.
enc = OneHotEncoder(sparse=False) test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, 2]]) test array([[0., 0., 1., 1., 0., 0., 0., 0., 1.], [0., 1., 0., 0., 1., 1., 0., 0., 0.], [1., 0., 0., 0., 1., 0., 1., 0., 0.], [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) enc = OneHotEncoder(sparse=False,n_values=[3,2,4]) test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, 2]]) test array([[0., 0., 0., 1., 0., 0., 0., 1., 1.], [0., 1., 0., 0., 0., 2., 0., 0., 0.], [1., 0., 0., 0., 0., 1., 1., 0., 0.], [0., 1., 0., 1., 0., 0., 0., 1., 0.]])
Cheers, Sarah
Cheers, Sarah
On Mon, Feb 5, 2018 at 12:02 AM, Joel Nothman <joel.nothman@gmail.com> wrote:
If each input column is encoded as a value from 0 to the (number of possible values for that column - 1) then n_values for that column should be the highest value + 1, which is also the number of levels per column. Does that make sense?
Actually, I've realised there's a somewhat slow and unnecessary bit of code in the one-hot encoder: where the COO matrix is converted to CSR. I suspect this was done because most of our ML algorithms perform better on CSR, or else to maintain backwards compatibility with an earlier implementation.
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Great. Thank you for all your help. Cheers, Sarah On Mon, Feb 5, 2018 at 12:56 AM, Joel Nothman <joel.nothman@gmail.com> wrote:
If you specify n_values=[list_of_vals_for_column1, list_of_vals_for_column2], you should be able to engineer it to how you want.
On 5 February 2018 at 16:31, Sarah Wait Zaranek <sarah.zaranek@gmail.com> wrote:
If I use the n+1 approach, then I get the correct matrix, except with the columns of zeros:
test array([[0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.], [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.]])
On Mon, Feb 5, 2018 at 12:25 AM, Sarah Wait Zaranek < sarah.zaranek@gmail.com> wrote:
Hi Joel -
Conceptually, that makes sense. But when I assign n_values, I can't make it match the result when you don't specify them. See below. I used the number of unique levels per column.
enc = OneHotEncoder(sparse=False) test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, 2]]) test array([[0., 0., 1., 1., 0., 0., 0., 0., 1.], [0., 1., 0., 0., 1., 1., 0., 0., 0.], [1., 0., 0., 0., 1., 0., 1., 0., 0.], [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) enc = OneHotEncoder(sparse=False,n_values=[3,2,4]) test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, 2]]) test array([[0., 0., 0., 1., 0., 0., 0., 1., 1.], [0., 1., 0., 0., 0., 2., 0., 0., 0.], [1., 0., 0., 0., 0., 1., 1., 0., 0.], [0., 1., 0., 1., 0., 0., 0., 1., 0.]])
Cheers, Sarah
Cheers, Sarah
On Mon, Feb 5, 2018 at 12:02 AM, Joel Nothman <joel.nothman@gmail.com> wrote:
If each input column is encoded as a value from 0 to the (number of possible values for that column - 1) then n_values for that column should be the highest value + 1, which is also the number of levels per column. Does that make sense?
Actually, I've realised there's a somewhat slow and unnecessary bit of code in the one-hot encoder: where the COO matrix is converted to CSR. I suspect this was done because most of our ML algorithms perform better on CSR, or else to maintain backwards compatibility with an earlier implementation.
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Hi Joel - I am also seeing a huge overhead in memory for calling the onehot-encoder. I have hacked it by running it splitting by matrix into 4-5 smaller matrices (by columns) and then concatenating the results. But, I am seeing upwards of 100 Gigs overhead. Should I file a bug report? Or is this to be expected. Cheers, Sarah On Mon, Feb 5, 2018 at 1:05 AM, Sarah Wait Zaranek <sarah.zaranek@gmail.com> wrote:
Great. Thank you for all your help.
Cheers, Sarah
On Mon, Feb 5, 2018 at 12:56 AM, Joel Nothman <joel.nothman@gmail.com> wrote:
If you specify n_values=[list_of_vals_for_column1, list_of_vals_for_column2], you should be able to engineer it to how you want.
On 5 February 2018 at 16:31, Sarah Wait Zaranek <sarah.zaranek@gmail.com> wrote:
If I use the n+1 approach, then I get the correct matrix, except with the columns of zeros:
test array([[0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.], [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.]])
On Mon, Feb 5, 2018 at 12:25 AM, Sarah Wait Zaranek < sarah.zaranek@gmail.com> wrote:
Hi Joel -
Conceptually, that makes sense. But when I assign n_values, I can't make it match the result when you don't specify them. See below. I used the number of unique levels per column.
> enc = OneHotEncoder(sparse=False) > test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, 2]]) > test array([[0., 0., 1., 1., 0., 0., 0., 0., 1.], [0., 1., 0., 0., 1., 1., 0., 0., 0.], [1., 0., 0., 0., 1., 0., 1., 0., 0.], [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) > enc = OneHotEncoder(sparse=False,n_values=[3,2,4]) > test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, 2]]) > test array([[0., 0., 0., 1., 0., 0., 0., 1., 1.], [0., 1., 0., 0., 0., 2., 0., 0., 0.], [1., 0., 0., 0., 0., 1., 1., 0., 0.], [0., 1., 0., 1., 0., 0., 0., 1., 0.]])
Cheers, Sarah
Cheers, Sarah
On Mon, Feb 5, 2018 at 12:02 AM, Joel Nothman <joel.nothman@gmail.com> wrote:
If each input column is encoded as a value from 0 to the (number of possible values for that column - 1) then n_values for that column should be the highest value + 1, which is also the number of levels per column. Does that make sense?
Actually, I've realised there's a somewhat slow and unnecessary bit of code in the one-hot encoder: where the COO matrix is converted to CSR. I suspect this was done because most of our ML algorithms perform better on CSR, or else to maintain backwards compatibility with an earlier implementation.
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
OneHotEncoder will not magically reduce the size of your input. It will necessarily increase the memory of the input data as long as we are storing the results in scipy.sparse matrices. The sparse representation will be less expensive than the dense representation, but it won't be less expensive than the input. On 6 February 2018 at 13:24, Sarah Wait Zaranek <sarah.zaranek@gmail.com> wrote:
Hi Joel -
I am also seeing a huge overhead in memory for calling the onehot-encoder. I have hacked it by running it splitting by matrix into 4-5 smaller matrices (by columns) and then concatenating the results. But, I am seeing upwards of 100 Gigs overhead. Should I file a bug report? Or is this to be expected.
Cheers, Sarah
On Mon, Feb 5, 2018 at 1:05 AM, Sarah Wait Zaranek < sarah.zaranek@gmail.com> wrote:
Great. Thank you for all your help.
Cheers, Sarah
On Mon, Feb 5, 2018 at 12:56 AM, Joel Nothman <joel.nothman@gmail.com> wrote:
If you specify n_values=[list_of_vals_for_column1, list_of_vals_for_column2], you should be able to engineer it to how you want.
On 5 February 2018 at 16:31, Sarah Wait Zaranek <sarah.zaranek@gmail.com
wrote:
If I use the n+1 approach, then I get the correct matrix, except with the columns of zeros:
> test array([[0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.], [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.]])
On Mon, Feb 5, 2018 at 12:25 AM, Sarah Wait Zaranek < sarah.zaranek@gmail.com> wrote:
Hi Joel -
Conceptually, that makes sense. But when I assign n_values, I can't make it match the result when you don't specify them. See below. I used the number of unique levels per column.
>> enc = OneHotEncoder(sparse=False) >> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, 2]]) >> test array([[0., 0., 1., 1., 0., 0., 0., 0., 1.], [0., 1., 0., 0., 1., 1., 0., 0., 0.], [1., 0., 0., 0., 1., 0., 1., 0., 0.], [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) >> enc = OneHotEncoder(sparse=False,n_values=[3,2,4]) >> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, 2]]) >> test array([[0., 0., 0., 1., 0., 0., 0., 1., 1.], [0., 1., 0., 0., 0., 2., 0., 0., 0.], [1., 0., 0., 0., 0., 1., 1., 0., 0.], [0., 1., 0., 1., 0., 0., 0., 1., 0.]])
Cheers, Sarah
Cheers, Sarah
On Mon, Feb 5, 2018 at 12:02 AM, Joel Nothman <joel.nothman@gmail.com> wrote:
If each input column is encoded as a value from 0 to the (number of possible values for that column - 1) then n_values for that column should be the highest value + 1, which is also the number of levels per column. Does that make sense?
Actually, I've realised there's a somewhat slow and unnecessary bit of code in the one-hot encoder: where the COO matrix is converted to CSR. I suspect this was done because most of our ML algorithms perform better on CSR, or else to maintain backwards compatibility with an earlier implementation.
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Yes, of course. What I mean is the I start out with 19 Gigs (initial matrix size) or so, it balloons to 100 Gigs *within the encoder function* and returns 28 Gigs (sparse one-hot matrix size). These numbers aren't exact, but you can see my point. Cheers, Sarah On Mon, Feb 5, 2018 at 9:50 PM, Joel Nothman <joel.nothman@gmail.com> wrote:
OneHotEncoder will not magically reduce the size of your input. It will necessarily increase the memory of the input data as long as we are storing the results in scipy.sparse matrices. The sparse representation will be less expensive than the dense representation, but it won't be less expensive than the input.
On 6 February 2018 at 13:24, Sarah Wait Zaranek <sarah.zaranek@gmail.com> wrote:
Hi Joel -
I am also seeing a huge overhead in memory for calling the onehot-encoder. I have hacked it by running it splitting by matrix into 4-5 smaller matrices (by columns) and then concatenating the results. But, I am seeing upwards of 100 Gigs overhead. Should I file a bug report? Or is this to be expected.
Cheers, Sarah
On Mon, Feb 5, 2018 at 1:05 AM, Sarah Wait Zaranek < sarah.zaranek@gmail.com> wrote:
Great. Thank you for all your help.
Cheers, Sarah
On Mon, Feb 5, 2018 at 12:56 AM, Joel Nothman <joel.nothman@gmail.com> wrote:
If you specify n_values=[list_of_vals_for_column1, list_of_vals_for_column2], you should be able to engineer it to how you want.
On 5 February 2018 at 16:31, Sarah Wait Zaranek < sarah.zaranek@gmail.com> wrote:
If I use the n+1 approach, then I get the correct matrix, except with the columns of zeros:
>> test array([[0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.], [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.]])
On Mon, Feb 5, 2018 at 12:25 AM, Sarah Wait Zaranek < sarah.zaranek@gmail.com> wrote:
Hi Joel -
Conceptually, that makes sense. But when I assign n_values, I can't make it match the result when you don't specify them. See below. I used the number of unique levels per column.
>>> enc = OneHotEncoder(sparse=False) >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, 2]]) >>> test array([[0., 0., 1., 1., 0., 0., 0., 0., 1.], [0., 1., 0., 0., 1., 1., 0., 0., 0.], [1., 0., 0., 0., 1., 0., 1., 0., 0.], [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) >>> enc = OneHotEncoder(sparse=False,n_values=[3,2,4]) >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, 2]]) >>> test array([[0., 0., 0., 1., 0., 0., 0., 1., 1.], [0., 1., 0., 0., 0., 2., 0., 0., 0.], [1., 0., 0., 0., 0., 1., 1., 0., 0.], [0., 1., 0., 1., 0., 0., 0., 1., 0.]])
Cheers, Sarah
Cheers, Sarah
On Mon, Feb 5, 2018 at 12:02 AM, Joel Nothman <joel.nothman@gmail.com > wrote:
> If each input column is encoded as a value from 0 to the (number of > possible values for that column - 1) then n_values for that column should > be the highest value + 1, which is also the number of levels per column. > Does that make sense? > > Actually, I've realised there's a somewhat slow and unnecessary bit > of code in the one-hot encoder: where the COO matrix is converted to CSR. I > suspect this was done because most of our ML algorithms perform better on > CSR, or else to maintain backwards compatibility with an earlier > implementation. > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Yes, the output CSR representation requires: 1 (dtype) value per entry 1 int32 per entry 1 int32 per row The intermediate COO representation requires: 1 (dtype) value per entry 2 int32 per entry So as long as the transformation from COO to CSR is done over the whole data, it will occupy roughly 5x the input size, which is exactly what you are experienciong. The CategoricalEncoder currently available in the development version of scikit-learn does not have this problem, but might be slower due to handling non-integer categories. It will also possibly disappear and be merged into OneHotEncoder soon (see PR #10523). On 6 February 2018 at 13:53, Sarah Wait Zaranek <sarah.zaranek@gmail.com> wrote:
Yes, of course. What I mean is the I start out with 19 Gigs (initial matrix size) or so, it balloons to 100 Gigs *within the encoder function* and returns 28 Gigs (sparse one-hot matrix size). These numbers aren't exact, but you can see my point.
Cheers, Sarah
On Mon, Feb 5, 2018 at 9:50 PM, Joel Nothman <joel.nothman@gmail.com> wrote:
OneHotEncoder will not magically reduce the size of your input. It will necessarily increase the memory of the input data as long as we are storing the results in scipy.sparse matrices. The sparse representation will be less expensive than the dense representation, but it won't be less expensive than the input.
On 6 February 2018 at 13:24, Sarah Wait Zaranek <sarah.zaranek@gmail.com> wrote:
Hi Joel -
I am also seeing a huge overhead in memory for calling the onehot-encoder. I have hacked it by running it splitting by matrix into 4-5 smaller matrices (by columns) and then concatenating the results. But, I am seeing upwards of 100 Gigs overhead. Should I file a bug report? Or is this to be expected.
Cheers, Sarah
On Mon, Feb 5, 2018 at 1:05 AM, Sarah Wait Zaranek < sarah.zaranek@gmail.com> wrote:
Great. Thank you for all your help.
Cheers, Sarah
On Mon, Feb 5, 2018 at 12:56 AM, Joel Nothman <joel.nothman@gmail.com> wrote:
If you specify n_values=[list_of_vals_for_column1, list_of_vals_for_column2], you should be able to engineer it to how you want.
On 5 February 2018 at 16:31, Sarah Wait Zaranek < sarah.zaranek@gmail.com> wrote:
If I use the n+1 approach, then I get the correct matrix, except with the columns of zeros:
>>> test array([[0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.], [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.]])
On Mon, Feb 5, 2018 at 12:25 AM, Sarah Wait Zaranek < sarah.zaranek@gmail.com> wrote:
> Hi Joel - > > Conceptually, that makes sense. But when I assign n_values, I can't > make it match the result when you don't specify them. See below. I used > the number of unique levels per column. > > >>> enc = OneHotEncoder(sparse=False) > >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, > 0, 2]]) > >>> test > array([[0., 0., 1., 1., 0., 0., 0., 0., 1.], > [0., 1., 0., 0., 1., 1., 0., 0., 0.], > [1., 0., 0., 0., 1., 0., 1., 0., 0.], > [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) > >>> enc = OneHotEncoder(sparse=False,n_values=[3,2,4]) > >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, > 0, 2]]) > >>> test > array([[0., 0., 0., 1., 0., 0., 0., 1., 1.], > [0., 1., 0., 0., 0., 2., 0., 0., 0.], > [1., 0., 0., 0., 0., 1., 1., 0., 0.], > [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) > > Cheers, > Sarah > > Cheers, > Sarah > > On Mon, Feb 5, 2018 at 12:02 AM, Joel Nothman < > joel.nothman@gmail.com> wrote: > >> If each input column is encoded as a value from 0 to the (number of >> possible values for that column - 1) then n_values for that column should >> be the highest value + 1, which is also the number of levels per column. >> Does that make sense? >> >> Actually, I've realised there's a somewhat slow and unnecessary bit >> of code in the one-hot encoder: where the COO matrix is converted to CSR. I >> suspect this was done because most of our ML algorithms perform better on >> CSR, or else to maintain backwards compatibility with an earlier >> implementation. >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Thanks, this makes sense. I will try using the CategoricalEncoder to see the difference. It wouldn't be such a big deal if my input matrix wasn't so large. Thanks again for all your help. Cheers, Sarah On Mon, Feb 5, 2018 at 10:33 PM, Joel Nothman <joel.nothman@gmail.com> wrote:
Yes, the output CSR representation requires: 1 (dtype) value per entry 1 int32 per entry 1 int32 per row
The intermediate COO representation requires: 1 (dtype) value per entry 2 int32 per entry
So as long as the transformation from COO to CSR is done over the whole data, it will occupy roughly 5x the input size, which is exactly what you are experienciong.
The CategoricalEncoder currently available in the development version of scikit-learn does not have this problem, but might be slower due to handling non-integer categories. It will also possibly disappear and be merged into OneHotEncoder soon (see PR #10523).
On 6 February 2018 at 13:53, Sarah Wait Zaranek <sarah.zaranek@gmail.com> wrote:
Yes, of course. What I mean is the I start out with 19 Gigs (initial matrix size) or so, it balloons to 100 Gigs *within the encoder function* and returns 28 Gigs (sparse one-hot matrix size). These numbers aren't exact, but you can see my point.
Cheers, Sarah
On Mon, Feb 5, 2018 at 9:50 PM, Joel Nothman <joel.nothman@gmail.com> wrote:
OneHotEncoder will not magically reduce the size of your input. It will necessarily increase the memory of the input data as long as we are storing the results in scipy.sparse matrices. The sparse representation will be less expensive than the dense representation, but it won't be less expensive than the input.
On 6 February 2018 at 13:24, Sarah Wait Zaranek <sarah.zaranek@gmail.com
wrote:
Hi Joel -
I am also seeing a huge overhead in memory for calling the onehot-encoder. I have hacked it by running it splitting by matrix into 4-5 smaller matrices (by columns) and then concatenating the results. But, I am seeing upwards of 100 Gigs overhead. Should I file a bug report? Or is this to be expected.
Cheers, Sarah
On Mon, Feb 5, 2018 at 1:05 AM, Sarah Wait Zaranek < sarah.zaranek@gmail.com> wrote:
Great. Thank you for all your help.
Cheers, Sarah
On Mon, Feb 5, 2018 at 12:56 AM, Joel Nothman <joel.nothman@gmail.com> wrote:
If you specify n_values=[list_of_vals_for_column1, list_of_vals_for_column2], you should be able to engineer it to how you want.
On 5 February 2018 at 16:31, Sarah Wait Zaranek < sarah.zaranek@gmail.com> wrote:
> If I use the n+1 approach, then I get the correct matrix, except > with the columns of zeros: > > >>> test > array([[0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1.], > [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.], > [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.], > [0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.]]) > > > On Mon, Feb 5, 2018 at 12:25 AM, Sarah Wait Zaranek < > sarah.zaranek@gmail.com> wrote: > >> Hi Joel - >> >> Conceptually, that makes sense. But when I assign n_values, I >> can't make it match the result when you don't specify them. See below. I >> used the number of unique levels per column. >> >> >>> enc = OneHotEncoder(sparse=False) >> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, >> 0, 2]]) >> >>> test >> array([[0., 0., 1., 1., 0., 0., 0., 0., 1.], >> [0., 1., 0., 0., 1., 1., 0., 0., 0.], >> [1., 0., 0., 0., 1., 0., 1., 0., 0.], >> [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) >> >>> enc = OneHotEncoder(sparse=False,n_values=[3,2,4]) >> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, >> 0, 2]]) >> >>> test >> array([[0., 0., 0., 1., 0., 0., 0., 1., 1.], >> [0., 1., 0., 0., 0., 2., 0., 0., 0.], >> [1., 0., 0., 0., 0., 1., 1., 0., 0.], >> [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) >> >> Cheers, >> Sarah >> >> Cheers, >> Sarah >> >> On Mon, Feb 5, 2018 at 12:02 AM, Joel Nothman < >> joel.nothman@gmail.com> wrote: >> >>> If each input column is encoded as a value from 0 to the (number >>> of possible values for that column - 1) then n_values for that column >>> should be the highest value + 1, which is also the number of levels per >>> column. Does that make sense? >>> >>> Actually, I've realised there's a somewhat slow and unnecessary >>> bit of code in the one-hot encoder: where the COO matrix is converted to >>> CSR. I suspect this was done because most of our ML algorithms perform >>> better on CSR, or else to maintain backwards compatibility with an earlier >>> implementation. >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn@python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Hello, I have installed the dev version (0.20.dev0), should I just use Categorical Encoder or is the functionality already rolled up into OneHotEncoder. I get the following message: File "", line 1, in File "/scikit-learn/sklearn/preprocessing/data.py", line 2839, in *init* "CategoricalEncoder briefly existed in 0.20dev. Its functionality " RuntimeError: CategoricalEncoder briefly existed in 0.20dev. Its functionality has been rolled into the OneHotEncoder and OrdinalEncoder. This stub will be removed in version 0.21. Cheers, Sarah On Mon, Feb 5, 2018 at 10:46 PM, Sarah Wait Zaranek <sarah.zaranek@gmail.com
wrote:
Thanks, this makes sense. I will try using the CategoricalEncoder to see the difference. It wouldn't be such a big deal if my input matrix wasn't so large. Thanks again for all your help.
Cheers, Sarah
On Mon, Feb 5, 2018 at 10:33 PM, Joel Nothman <joel.nothman@gmail.com> wrote:
Yes, the output CSR representation requires: 1 (dtype) value per entry 1 int32 per entry 1 int32 per row
The intermediate COO representation requires: 1 (dtype) value per entry 2 int32 per entry
So as long as the transformation from COO to CSR is done over the whole data, it will occupy roughly 5x the input size, which is exactly what you are experienciong.
The CategoricalEncoder currently available in the development version of scikit-learn does not have this problem, but might be slower due to handling non-integer categories. It will also possibly disappear and be merged into OneHotEncoder soon (see PR #10523).
On 6 February 2018 at 13:53, Sarah Wait Zaranek <sarah.zaranek@gmail.com> wrote:
Yes, of course. What I mean is the I start out with 19 Gigs (initial matrix size) or so, it balloons to 100 Gigs *within the encoder function* and returns 28 Gigs (sparse one-hot matrix size). These numbers aren't exact, but you can see my point.
Cheers, Sarah
On Mon, Feb 5, 2018 at 9:50 PM, Joel Nothman <joel.nothman@gmail.com> wrote:
OneHotEncoder will not magically reduce the size of your input. It will necessarily increase the memory of the input data as long as we are storing the results in scipy.sparse matrices. The sparse representation will be less expensive than the dense representation, but it won't be less expensive than the input.
On 6 February 2018 at 13:24, Sarah Wait Zaranek < sarah.zaranek@gmail.com> wrote:
Hi Joel -
I am also seeing a huge overhead in memory for calling the onehot-encoder. I have hacked it by running it splitting by matrix into 4-5 smaller matrices (by columns) and then concatenating the results. But, I am seeing upwards of 100 Gigs overhead. Should I file a bug report? Or is this to be expected.
Cheers, Sarah
On Mon, Feb 5, 2018 at 1:05 AM, Sarah Wait Zaranek < sarah.zaranek@gmail.com> wrote:
Great. Thank you for all your help.
Cheers, Sarah
On Mon, Feb 5, 2018 at 12:56 AM, Joel Nothman <joel.nothman@gmail.com > wrote:
> If you specify n_values=[list_of_vals_for_column1, > list_of_vals_for_column2], you should be able to engineer it to how you > want. > > On 5 February 2018 at 16:31, Sarah Wait Zaranek < > sarah.zaranek@gmail.com> wrote: > >> If I use the n+1 approach, then I get the correct matrix, except >> with the columns of zeros: >> >> >>> test >> array([[0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1.], >> [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.], >> [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.], >> [0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., >> 0.]]) >> >> >> On Mon, Feb 5, 2018 at 12:25 AM, Sarah Wait Zaranek < >> sarah.zaranek@gmail.com> wrote: >> >>> Hi Joel - >>> >>> Conceptually, that makes sense. But when I assign n_values, I >>> can't make it match the result when you don't specify them. See below. I >>> used the number of unique levels per column. >>> >>> >>> enc = OneHotEncoder(sparse=False) >>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, >>> 0, 2]]) >>> >>> test >>> array([[0., 0., 1., 1., 0., 0., 0., 0., 1.], >>> [0., 1., 0., 0., 1., 1., 0., 0., 0.], >>> [1., 0., 0., 0., 1., 0., 1., 0., 0.], >>> [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) >>> >>> enc = OneHotEncoder(sparse=False,n_values=[3,2,4]) >>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, >>> 0, 2]]) >>> >>> test >>> array([[0., 0., 0., 1., 0., 0., 0., 1., 1.], >>> [0., 1., 0., 0., 0., 2., 0., 0., 0.], >>> [1., 0., 0., 0., 0., 1., 1., 0., 0.], >>> [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) >>> >>> Cheers, >>> Sarah >>> >>> Cheers, >>> Sarah >>> >>> On Mon, Feb 5, 2018 at 12:02 AM, Joel Nothman < >>> joel.nothman@gmail.com> wrote: >>> >>>> If each input column is encoded as a value from 0 to the (number >>>> of possible values for that column - 1) then n_values for that column >>>> should be the highest value + 1, which is also the number of levels per >>>> column. Does that make sense? >>>> >>>> Actually, I've realised there's a somewhat slow and unnecessary >>>> bit of code in the one-hot encoder: where the COO matrix is converted to >>>> CSR. I suspect this was done because most of our ML algorithms perform >>>> better on CSR, or else to maintain backwards compatibility with an earlier >>>> implementation. >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn@python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Hi Joel - Are you sure? I ran it and it actually uses bit more memory instead of less, same code just run with a different docker container. Max memory used by a single task: 50.41GB vs Max memory used by a single task: 51.15GB Cheers, Sarah On Wed, Aug 1, 2018 at 7:19 PM, Sarah Wait Zaranek <sarah.zaranek@gmail.com> wrote:
In the developer version, yes? Looking for the new memory savings :)
On Wed, Aug 1, 2018, 17:29 Joel Nothman <joel.nothman@gmail.com> wrote:
Use OneHotEncoder
Hi Sarah, I have some reflection questions. You don't need to answer all of them :) how many categories (approximately) do you have in each of those 20M categorical variables? How many samples do you have? Maybe you should consider different encoding strategies such as binary encoding. Also, this looks like a big data problem. Have you considered using distributed computing? Also, do you really need to use all of those 20M variables in your first approach? Consider using feature selection techniques. I would suggest that you start with something simpler with less features and that run more easily in your machine. Then later you can starting adding more complexity if necessary. Keep in mind that if the number of samples is lower than the number of columns after one hot encoding, you might face overfitting. Try to always have less columns than the number of samples. On Aug 2, 2018 12:53, "Sarah Wait Zaranek" <sarah.zaranek@gmail.com> wrote: Hi Joel - Are you sure? I ran it and it actually uses bit more memory instead of less, same code just run with a different docker container. Max memory used by a single task: 50.41GB vs Max memory used by a single task: 51.15GB Cheers, Sarah On Wed, Aug 1, 2018 at 7:19 PM, Sarah Wait Zaranek <sarah.zaranek@gmail.com> wrote:
In the developer version, yes? Looking for the new memory savings :)
On Wed, Aug 1, 2018, 17:29 Joel Nothman <joel.nothman@gmail.com> wrote:
Use OneHotEncoder
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Hi all - I can't do binary encoding because I need to trace back to the exact categorical variable and that is difficult in binary encoding, I believe. Each categorical variable has a range, but on average it is about 10 categories. I return a sparse matrix from the encoder. Regardless of the encoding strategy, the issue is the overhead of the encoding itself not the resulting encoded matrix so using an encoding which is slightly smaller isn't going to solve my issue as far as I am aware. We have done the tests just with the integer representation of the categorical variable and the results are unsatisfying. If there is an encoder that isn't lossy - that I can get my original category back and not see as large a memory requirement as with the one-hot creation -- I am happy to try it out. Yes, I need this many variables, and there are all categorical. In my world, we have short and wide matrices-- it is very common. Unfortunately, I need to do feature selection techniques on the encoded version of the data - which I could possible do in parts for so of the feature selection techniques but for the ones I really want to use I need the entire matrix (think very large ReliefF). I already have a working version on my machine with less data (and btw my "machine" is one of the biggest instances available in my region with 400GB+ of RAM). I am eventually moving to using a distributed computing solution (mLlib+Spark), but I wanted to see what I could do in scikit-learn before I went there. Of course, I am aware of overfitting issues-- we do regularization and cross validation, etc. I just thought it was unfortunately that the thing holding my analysis back from using scikit-learn wasn't the machine learning but the encoding algorithm, memory requirements. Cheers, Sarah On Fri, Aug 3, 2018 at 7:52 AM, Fernando Marcos Wittmann < fernando.wittmann@gmail.com> wrote:
Hi Sarah, I have some reflection questions. You don't need to answer all of them :) how many categories (approximately) do you have in each of those 20M categorical variables? How many samples do you have? Maybe you should consider different encoding strategies such as binary encoding. Also, this looks like a big data problem. Have you considered using distributed computing? Also, do you really need to use all of those 20M variables in your first approach? Consider using feature selection techniques. I would suggest that you start with something simpler with less features and that run more easily in your machine. Then later you can starting adding more complexity if necessary. Keep in mind that if the number of samples is lower than the number of columns after one hot encoding, you might face overfitting. Try to always have less columns than the number of samples.
On Aug 2, 2018 12:53, "Sarah Wait Zaranek" <sarah.zaranek@gmail.com> wrote:
Hi Joel -
Are you sure? I ran it and it actually uses bit more memory instead of less, same code just run with a different docker container.
Max memory used by a single task: 50.41GB vs Max memory used by a single task: 51.15GB
Cheers, Sarah
On Wed, Aug 1, 2018 at 7:19 PM, Sarah Wait Zaranek < sarah.zaranek@gmail.com> wrote:
In the developer version, yes? Looking for the new memory savings :)
On Wed, Aug 1, 2018, 17:29 Joel Nothman <joel.nothman@gmail.com> wrote:
Use OneHotEncoder
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
participants (3)
-
Fernando Marcos Wittmann -
Joel Nothman -
Sarah Wait Zaranek