[scikit-learn] Bm25
Basil Beirouti
basilbeirouti at gmail.com
Fri Jul 1 18:47:40 EDT 2016
Oh yes that's exactly what I was looking for. So how do I initialize an array with the same sparsity pattern as X? And then how do I do an element wise divide of the numerator over the denominator, when both are sparse matrices? Like you said it should only do this operation on the non zero elements of the numerator.
Sent from my iPhone
> On Jul 1, 2016, at 5:36 PM, Vlad Niculae <zephyr14 at gmail.com> wrote:
>
> In the denominator you mean? It looks like you only need to add that to nonzero elements, since the others would all have a 0 in the numerator, right? So the final value would be zero there. Or am I missing something?
>
> You can initialize an array with the same sparsity pattern as X, but its data is k everywhere. Then use inplace_row_scale to multiply it by B, then add this to X to get the denominator.
>
>> On July 1, 2016 6:27:41 PM EDT, Basil Beirouti <basilbeirouti at gmail.com> wrote:
>> Hi Vlad,
>>
>> Thanks for the quick reply. Unfortunately there's still the question of adding a scalar to every element in sparse matrix, which is not allowed for sparse matrices, and which is not possible to avoid in the equation.
>>
>> Sincerely,
>> Basil Beirouti
>>
>>
>>> On Jul 1, 2016, at 4:36 PM, scikit-learn-request at python.org wrote:
>>>
>>> Send scikit-learn mailing list submissions to
>>> scikit-learn at python.org
>>>
>>> To subscribe or unsubscribe via the World Wide Web, visit
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> or, via email, send a message with subject or body 'help' to
>>> scikit-learn-request at python.org
>>>
>>> You can reach the person managing the list at
>>>
>>> scikit-learn-owner at python.org
>>>
>>> When replying, please edit your Subject line so it is more specific
>>> than "Re: Contents of scikit-learn digest..."
>>>
>>>
>>> Today's Topics:
>>>
>>> 1. Adding BM25 to scikit-learn.feature_extraction.text
>>> (Basil Beirouti)
>>> 2. Re: Adding BM25 to scikit-learn.feature_extraction.text
>>> (Vlad Niculae)
>>>
>>>
>>>
>>>
>>> Message: 1
>>> Date: Fri, 1 Jul 2016 16:17:43 -0500
>>> From: Basil Beirouti <basilbeirouti at gmail.com>
>>> To: scikit-learn at python.org
>>> Subject: [scikit-learn] Adding BM25 to
>>> scikit-learn.feature_extraction.text
>>> Message-ID:
>>> <CAB4mTg8805nNdAja5cscf+pHrJyq0btC-AGzegd8Cqb95sVdHw at mail.gmail.com>
>>> Content-Type: text/plain; charset="utf-8"
>>>
>>> Hi everyone,
>>>
>>> to put it succinctly, here's the BM25 equation:
>>>
>>> f(w,D) * (k+1) / (k*B + f(w,D))
>>>
>>> where w is the word, and D is the
>>> document (corresponding to rows and
>>> columns, respectively). f is a sparse matrix because only a fraction of the
>>> whole vocabulary of words appears in any given single document.
>>>
>>> B is a function of only the document, but it doesn't matter, you can think
>>> of it as a constant if you want.
>>>
>>> The problem is since f(w,D) is almost always zero, I only need to do the
>>> calculation (ie. multiply by (k+1) then divide by (k*B + f(w,D))) when
>>> f(w,D) is not zero. Is there a clever way to do this with masks?
>>>
>>> You can refactor the above equation to get this:
>>>
>>> (k+1)/(k*B/f(w,D) + 1) but alas we still have f(w,D) appearing in a
>>> denominator, which is bad (because of dividing by zero).
>>>
>>> So anyway, currently I am converting to a coo_matrix and iterator through
>>> the non-zero values like this:
>>>
>>> cx = x.tocoo()
>>> for i,j,v in itertools.izip(cx.row, cx.col, cx.data):
>>> (i,j,v)
>>>
>>>
>>> That iterator is incredibly fast, but unfortunately coo_matrix does
>>> not support assignment. So I create a new copy of either a dok sparse
>>> matrix or a regular numpy array and assign to that.
>>>
>>> I could also deal directly with the .data, .indptr, and indices
>>> attributes of csr_matrix, and see if it's possible to create a copy of
>>> .data attribute and update the values accordingly. I was hoping
>>> somebody had encountered this type of issue before.
>>>
>>> Sincerely,
>>>
>>> Basil Beirouti
>>> -------------- next part --------------
>>> An HTML attachment was scrubbed...
>>> URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160701/8970d05a/attachment-0001.html>
>>>
>>>
>>>
>>> Message: 2
>>> Date: Fri, 01 Jul 2016 17:35:49 -0400
>>> From: Vlad Niculae
>>> <zephyr14 at gmail.com>
>>> To: Scikit-learn user and developer mailing list
>>> <scikit-learn at python.org>
>>> Subject: Re: [scikit-learn] Adding BM25 to
>>> scikit-learn.feature_extraction.text
>>> Message-ID: <D4036481-5AC4-44A6-810B-F347339557C9 at gmail.com>
>>> Content-Type: text/plain; charset="utf-8"
>>>
>>> Hi Basil,
>>>
>>> If B were just a constant, you could do the whole thing as a vectorized operation on X.data.
>>>
>>> Since I understand B is a n_samples vector, I think the cleanest way to compute the denominator is using sklearn.utils.sparsefuncs.inplace_row_scale.
>>>
>>> Hope this helps,
>>>
>>> Vlad
>>>
>>>
>>>> On July 1, 2016 5:17:43 PM EDT, Basil Beirouti <basilbeirouti at gmail.com> wrote:
>>>> Hi everyone,
>>>>
>>>> to put it succinctly, here's the BM25 equation:
>>>>
>>>>
>>>> f(w,D) * (k+1) / (k*B + f(w,D))
>>>>
>>>> where w is the word, and D is the document (corresponding to rows and
>>>> columns, respectively). f is a sparse matrix because only a fraction of
>>>> the
>>>> whole vocabulary of words appears in any given single document.
>>>>
>>>> B is a function of only the document, but it doesn't matter, you can
>>>> think
>>>> of it as a constant if you want.
>>>>
>>>> The problem is since f(w,D) is almost always zero, I only need to do
>>>> the
>>>> calculation (ie. multiply by (k+1) then divide by (k*B + f(w,D))) when
>>>> f(w,D) is not zero. Is there a clever way to do this with masks?
>>>>
>>>> You can refactor the above equation to get this:
>>>>
>>>> (k+1)/(k*B/f(w,D) + 1) but alas we still have f(w,D) appearing in a
>>>> denominator, which is bad (because of dividing by zero).
>>>>
>>>> So anyway, currently I am converting to a coo_matrix and iterator
>>>> through
>>>> the non-zero values like this:
>>>>
>>>>
>>>> cx = x.tocoo()
>>>> for i,j,v in itertools.izip(cx.row, cx.col, cx.data):
>>>> (i,j,v)
>>>>
>>>>
>>>> That iterator is incredibly fast, but unfortunately coo_matrix does
>>>> not support assignment. So I create a new copy of either a dok sparse
>>>> matrix or a regular numpy array and assign to that.
>>>>
>>>> I could also deal directly with the .data, .indptr, and indices
>>>> attributes of csr_matrix, and see if it's possible to create a copy of
>>>> .data attribute and update the values accordingly. I was hoping
>>>> somebody had encountered this type of issue before.
>>>>
>>>> Sincerely,
>>>>
>>>> Basil Beirouti
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>> --
>>> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>>> -------------- next part --------------
>>> An HTML attachment was scrubbed...
>>> URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160701/ca1e4e96/attachment.html>
>>>
>>>
>>>
>>> Subject: Digest Footer
>>>
>>>
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>>
>>>
>>> End of scikit-learn Digest, Vol 4, Issue 3
>>> ******************************************
>>
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160701/942c9154/attachment.html>
More information about the scikit-learn
mailing list