Hi Vlad, Thanks for the quick reply. Unfortunately there's still the question of adding a scalar to every element in sparse matrix, which is not allowed for sparse matrices, and which is not possible to avoid in the equation. Sincerely, Basil Beirouti
On Jul 1, 2016, at 4:36 PM, scikit-learn-request@python.org wrote:
Send scikit-learn mailing list submissions to scikit-learn@python.org
To subscribe or unsubscribe via the World Wide Web, visit https://mail.python.org/mailman/listinfo/scikit-learn or, via email, send a message with subject or body 'help' to scikit-learn-request@python.org
You can reach the person managing the list at scikit-learn-owner@python.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of scikit-learn digest..."
Today's Topics:
1. Adding BM25 to scikit-learn.feature_extraction.text (Basil Beirouti) 2. Re: Adding BM25 to scikit-learn.feature_extraction.text (Vlad Niculae)
----------------------------------------------------------------------
Message: 1 Date: Fri, 1 Jul 2016 16:17:43 -0500 From: Basil Beirouti <basilbeirouti@gmail.com> To: scikit-learn@python.org Subject: [scikit-learn] Adding BM25 to scikit-learn.feature_extraction.text Message-ID: <CAB4mTg8805nNdAja5cscf+pHrJyq0btC-AGzegd8Cqb95sVdHw@mail.gmail.com> Content-Type: text/plain; charset="utf-8"
Hi everyone,
to put it succinctly, here's the BM25 equation:
f(w,D) * (k+1) / (k*B + f(w,D))
where w is the word, and D is the document (corresponding to rows and columns, respectively). f is a sparse matrix because only a fraction of the whole vocabulary of words appears in any given single document.
B is a function of only the document, but it doesn't matter, you can think of it as a constant if you want.
The problem is since f(w,D) is almost always zero, I only need to do the calculation (ie. multiply by (k+1) then divide by (k*B + f(w,D))) when f(w,D) is not zero. Is there a clever way to do this with masks?
You can refactor the above equation to get this:
(k+1)/(k*B/f(w,D) + 1) but alas we still have f(w,D) appearing in a denominator, which is bad (because of dividing by zero).
So anyway, currently I am converting to a coo_matrix and iterator through the non-zero values like this:
cx = x.tocoo() for i,j,v in itertools.izip(cx.row, cx.col, cx.data): (i,j,v)
That iterator is incredibly fast, but unfortunately coo_matrix does not support assignment. So I create a new copy of either a dok sparse matrix or a regular numpy array and assign to that.
I could also deal directly with the .data, .indptr, and indices attributes of csr_matrix, and see if it's possible to create a copy of .data attribute and update the values accordingly. I was hoping somebody had encountered this type of issue before.
Sincerely,
Basil Beirouti
In the denominator you mean? It looks like you only need to add that to nonzero elements, since the others would all have a 0 in the numerator, right? So the final value would be zero there. Or am I missing something? You can initialize an array with the same sparsity pattern as X, but its data is k everywhere. Then use inplace_row_scale to multiply it by B, then add this to X to get the denominator. On July 1, 2016 6:27:41 PM EDT, Basil Beirouti <basilbeirouti@gmail.com> wrote:
Hi Vlad,
Thanks for the quick reply. Unfortunately there's still the question of adding a scalar to every element in sparse matrix, which is not allowed for sparse matrices, and which is not possible to avoid in the equation.
Sincerely, Basil Beirouti
On Jul 1, 2016, at 4:36 PM, scikit-learn-request@python.org wrote:
Send scikit-learn mailing list submissions to scikit-learn@python.org
To subscribe or unsubscribe via the World Wide Web, visit https://mail.python.org/mailman/listinfo/scikit-learn or, via email, send a message with subject or body 'help' to scikit-learn-request@python.org
You can reach the person managing the list at scikit-learn-owner@python.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of scikit-learn digest..."
Today's Topics:
1. Adding BM25 to scikit-learn.feature_extraction.text (Basil Beirouti) 2. Re: Adding BM25 to scikit-learn.feature_extraction.text (Vlad Niculae)
----------------------------------------------------------------------
Message: 1 Date: Fri, 1 Jul 2016 16:17:43 -0500 From: Basil Beirouti <basilbeirouti@gmail.com> To: scikit-learn@python.org Subject: [scikit-learn] Adding BM25 to scikit-learn.feature_extraction.text Message-ID:
Content-Type: text/plain; charset="utf-8"
Hi everyone,
to put it succinctly, here's the BM25 equation:
f(w,D) * (k+1) / (k*B + f(w,D))
where w is the word, and D is the document (corresponding to rows and columns, respectively). f is a sparse matrix because only a fraction of the whole vocabulary of words appears in any given single document.
B is a function of only the document, but it doesn't matter, you can
of it as a constant if you want.
The problem is since f(w,D) is almost always zero, I only need to do
calculation (ie. multiply by (k+1) then divide by (k*B + f(w,D))) when f(w,D) is not zero. Is there a clever way to do this with masks?
You can refactor the above equation to get this:
(k+1)/(k*B/f(w,D) + 1) but alas we still have f(w,D) appearing in a denominator, which is bad (because of dividing by zero).
So anyway, currently I am converting to a coo_matrix and iterator
<CAB4mTg8805nNdAja5cscf+pHrJyq0btC-AGzegd8Cqb95sVdHw@mail.gmail.com> think the through
the non-zero values like this:
cx = x.tocoo() for i,j,v in itertools.izip(cx.row, cx.col, cx.data): (i,j,v)
That iterator is incredibly fast, but unfortunately coo_matrix does not support assignment. So I create a new copy of either a dok sparse matrix or a regular numpy array and assign to that.
I could also deal directly with the .data, .indptr, and indices attributes of csr_matrix, and see if it's possible to create a copy of .data attribute and update the values accordingly. I was hoping somebody had encountered this type of issue before.
Sincerely,
Basil Beirouti
Oh yes that's exactly what I was looking for. So how do I initialize an array with the same sparsity pattern as X? And then how do I do an element wise divide of the numerator over the denominator, when both are sparse matrices? Like you said it should only do this operation on the non zero elements of the numerator. Sent from my iPhone
On Jul 1, 2016, at 5:36 PM, Vlad Niculae <zephyr14@gmail.com> wrote:
In the denominator you mean? It looks like you only need to add that to nonzero elements, since the others would all have a 0 in the numerator, right? So the final value would be zero there. Or am I missing something?
You can initialize an array with the same sparsity pattern as X, but its data is k everywhere. Then use inplace_row_scale to multiply it by B, then add this to X to get the denominator.
On July 1, 2016 6:27:41 PM EDT, Basil Beirouti <basilbeirouti@gmail.com> wrote: Hi Vlad,
Thanks for the quick reply. Unfortunately there's still the question of adding a scalar to every element in sparse matrix, which is not allowed for sparse matrices, and which is not possible to avoid in the equation.
Sincerely, Basil Beirouti
On Jul 1, 2016, at 4:36 PM, scikit-learn-request@python.org wrote:
Send scikit-learn mailing list submissions to scikit-learn@python.org
To subscribe or unsubscribe via the World Wide Web, visit https://mail.python.org/mailman/listinfo/scikit-learn or, via email, send a message with subject or body 'help' to scikit-learn-request@python.org
You can reach the person managing the list at
scikit-learn-owner@python.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of scikit-learn digest..."
Today's Topics:
1. Adding BM25 to scikit-learn.feature_extraction.text (Basil Beirouti) 2. Re: Adding BM25 to scikit-learn.feature_extraction.text (Vlad Niculae)
Message: 1 Date: Fri, 1 Jul 2016 16:17:43 -0500 From: Basil Beirouti <basilbeirouti@gmail.com> To: scikit-learn@python.org Subject: [scikit-learn] Adding BM25 to scikit-learn.feature_extraction.text Message-ID: <CAB4mTg8805nNdAja5cscf+pHrJyq0btC-AGzegd8Cqb95sVdHw@mail.gmail.com> Content-Type: text/plain; charset="utf-8"
Hi everyone,
to put it succinctly, here's the BM25 equation:
f(w,D) * (k+1) / (k*B + f(w,D))
where w is the word, and D is the document (corresponding to rows and columns, respectively). f is a sparse matrix because only a fraction of the whole vocabulary of words appears in any given single document.
B is a function of only the document, but it doesn't matter, you can think of it as a constant if you want.
The problem is since f(w,D) is almost always zero, I only need to do the calculation (ie. multiply by (k+1) then divide by (k*B + f(w,D))) when f(w,D) is not zero. Is there a clever way to do this with masks?
You can refactor the above equation to get this:
(k+1)/(k*B/f(w,D) + 1) but alas we still have f(w,D) appearing in a denominator, which is bad (because of dividing by zero).
So anyway, currently I am converting to a coo_matrix and iterator through the non-zero values like this:
cx = x.tocoo() for i,j,v in itertools.izip(cx.row, cx.col, cx.data): (i,j,v)
That iterator is incredibly fast, but unfortunately coo_matrix does not support assignment. So I create a new copy of either a dok sparse matrix or a regular numpy array and assign to that.
I could also deal directly with the .data, .indptr, and indices attributes of csr_matrix, and see if it's possible to create a copy of .data attribute and update the values accordingly. I was hoping somebody had encountered this type of issue before.
Sincerely,
Basil Beirouti
For the first question, look up the possible ways to construct scipy.sparse.csr_matrix objects; one of them will take (data, indices, indptr). Just pass a new array for data, and take the latter two from X. For the second question, you can just do the elementwise operation in place on the data array, since they have the same shape in this case. You can try playing around with these operations in a notebook and benchmarking them with %timeit/%memit, to see how to best organize them. I find such exercises very rewarding. Cheers, Vlad On July 1, 2016 6:47:40 PM EDT, Basil Beirouti <basilbeirouti@gmail.com> wrote:
Oh yes that's exactly what I was looking for. So how do I initialize an array with the same sparsity pattern as X? And then how do I do an element wise divide of the numerator over the denominator, when both are sparse matrices? Like you said it should only do this operation on the non zero elements of the numerator.
Sent from my iPhone
On Jul 1, 2016, at 5:36 PM, Vlad Niculae <zephyr14@gmail.com> wrote:
In the denominator you mean? It looks like you only need to add that to nonzero elements, since the others would all have a 0 in the numerator, right? So the final value would be zero there. Or am I missing something?
You can initialize an array with the same sparsity pattern as X, but its data is k everywhere. Then use inplace_row_scale to multiply it by B, then add this to X to get the denominator.
On July 1, 2016 6:27:41 PM EDT, Basil Beirouti <basilbeirouti@gmail.com> wrote: Hi Vlad,
Thanks for the quick reply. Unfortunately there's still the question of adding a scalar to every element in sparse matrix, which is not allowed for sparse matrices, and which is not possible to avoid in the equation.
Sincerely, Basil Beirouti
On Jul 1, 2016, at 4:36 PM, scikit-learn-request@python.org wrote:
Send scikit-learn mailing list submissions to scikit-learn@python.org
To subscribe or unsubscribe via the World Wide Web, visit https://mail.python.org/mailman/listinfo/scikit-learn or, via email, send a message with subject or body 'help' to scikit-learn-request@python.org
You can reach the person managing the list at
scikit-learn-owner@python.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of scikit-learn digest..."
Today's Topics:
1. Adding BM25 to scikit-learn.feature_extraction.text (Basil Beirouti) 2. Re: Adding BM25 to scikit-learn.feature_extraction.text (Vlad Niculae)
Message: 1 Date: Fri, 1 Jul 2016 16:17:43 -0500 From: Basil Beirouti <basilbeirouti@gmail.com> To: scikit-learn@python.org Subject: [scikit-learn] Adding BM25 to scikit-learn.feature_extraction.text Message-ID:
Content-Type: text/plain; charset="utf-8"
Hi everyone,
to put it succinctly, here's the BM25 equation:
f(w,D) * (k+1) / (k*B + f(w,D))
where w is the word, and D is the document (corresponding to rows and columns, respectively). f is a sparse matrix because only a fraction of the whole vocabulary of words appears in any given single document.
B is a function of only the document, but it doesn't matter, you can think of it as a constant if you want.
The problem is since f(w,D) is almost always zero, I only need to do the calculation (ie. multiply by (k+1) then divide by (k*B + f(w,D))) when f(w,D) is not zero. Is there a clever way to do this with masks?
You can refactor the above equation to get this:
(k+1)/(k*B/f(w,D) + 1) but alas we still have f(w,D) appearing in a denominator, which is bad (because of dividing by zero).
So anyway, currently I am converting to a coo_matrix and iterator
<CAB4mTg8805nNdAja5cscf+pHrJyq0btC-AGzegd8Cqb95sVdHw@mail.gmail.com> through
the non-zero values like this:
cx = x.tocoo() for i,j,v in itertools.izip(cx.row, cx.col, cx.data): (i,j,v)
That iterator is incredibly fast, but unfortunately coo_matrix does not support assignment. So I create a new copy of either a dok sparse matrix or a regular numpy array and assign to that.
I could also deal directly with the .data, .indptr, and indices attributes of csr_matrix, and see if it's possible to create a copy of .data attribute and update the values accordingly. I was hoping somebody had encountered this type of issue before.
Sincerely,
Basil Beirouti
For the first question, look up the possible ways to construct scipy.sparse.csr_matrix objects; one of them will take (data, indices, indptr). Just pass a new array for data, and take the latter two from X. For the second question, you can just do the elementwise operation in place on the data array, since they have the same shape in this case. You can try playing around with these operations in a notebook and benchmarking them with %timeit/%memit, to see how to best organize them. I find such exercises very rewarding. Cheers, Vlad On July 1, 2016 6:47:40 PM EDT, Basil Beirouti <basilbeirouti@gmail.com> wrote:
Oh yes that's exactly what I was looking for. So how do I initialize an array with the same sparsity pattern as X? And then how do I do an element wise divide of the numerator over the denominator, when both are sparse matrices? Like you said it should only do this operation on the non zero elements of the numerator.
Sent from my iPhone
On Jul 1, 2016, at 5:36 PM, Vlad Niculae <zephyr14@gmail.com> wrote:
In the denominator you mean? It looks like you only need to add that to nonzero elements, since the others would all have a 0 in the numerator, right? So the final value would be zero there. Or am I missing something?
You can initialize an array with the same sparsity pattern as X, but its data is k everywhere. Then use inplace_row_scale to multiply it by B, then add this to X to get the denominator.
On July 1, 2016 6:27:41 PM EDT, Basil Beirouti <basilbeirouti@gmail.com> wrote: Hi Vlad,
Thanks for the quick reply. Unfortunately there's still the question of adding a scalar to every element in sparse matrix, which is not allowed for sparse matrices, and which is not possible to avoid in the equation.
Sincerely, Basil Beirouti
On Jul 1, 2016, at 4:36 PM, scikit-learn-request@python.org wrote:
Send scikit-learn mailing list submissions to scikit-learn@python.org
To subscribe or unsubscribe via the World Wide Web, visit https://mail.python.org/mailman/listinfo/scikit-learn or, via email, send a message with subject or body 'help' to scikit-learn-request@python.org
You can reach the person managing the list at
scikit-learn-owner@python.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of scikit-learn digest..."
Today's Topics:
1. Adding BM25 to scikit-learn.feature_extraction.text (Basil Beirouti) 2. Re: Adding BM25 to scikit-learn.feature_extraction.text (Vlad Niculae)
Message: 1 Date: Fri, 1 Jul 2016 16:17:43 -0500 From: Basil Beirouti <basilbeirouti@gmail.com> To: scikit-learn@python.org Subject: [scikit-learn] Adding BM25 to scikit-learn.feature_extraction.text Message-ID:
Content-Type: text/plain; charset="utf-8"
Hi everyone,
to put it succinctly, here's the BM25 equation:
f(w,D) * (k+1) / (k*B + f(w,D))
where w is the word, and D is the document (corresponding to rows and columns, respectively). f is a sparse matrix because only a fraction of the whole vocabulary of words appears in any given single document.
B is a function of only the document, but it doesn't matter, you can think of it as a constant if you want.
The problem is since f(w,D) is almost always zero, I only need to do the calculation (ie. multiply by (k+1) then divide by (k*B + f(w,D))) when f(w,D) is not zero. Is there a clever way to do this with masks?
You can refactor the above equation to get this:
(k+1)/(k*B/f(w,D) + 1) but alas we still have f(w,D) appearing in a denominator, which is bad (because of dividing by zero).
So anyway, currently I am converting to a coo_matrix and iterator
<CAB4mTg8805nNdAja5cscf+pHrJyq0btC-AGzegd8Cqb95sVdHw@mail.gmail.com> through
the non-zero values like this:
cx = x.tocoo() for i,j,v in itertools.izip(cx.row, cx.col, cx.data): (i,j,v)
That iterator is incredibly fast, but unfortunately coo_matrix does not support assignment. So I create a new copy of either a dok sparse matrix or a regular numpy array and assign to that.
I could also deal directly with the .data, .indptr, and indices attributes of csr_matrix, and see if it's possible to create a copy of .data attribute and update the values accordingly. I was hoping somebody had encountered this type of issue before.
Sincerely,
Basil Beirouti
participants (2)
-
Basil Beirouti -
Vlad Niculae