[scikit-learn] Bm25
Vlad Niculae
zephyr14 at gmail.com
Fri Jul 1 18:36:42 EDT 2016
In the denominator you mean? It looks like you only need to add that to nonzero elements, since the others would all have a 0 in the numerator, right? So the final value would be zero there. Or am I missing something?
You can initialize an array with the same sparsity pattern as X, but its data is k everywhere. Then use inplace_row_scale to multiply it by B, then add this to X to get the denominator.
On July 1, 2016 6:27:41 PM EDT, Basil Beirouti <basilbeirouti at gmail.com> wrote:
>Hi Vlad,
>
>Thanks for the quick reply. Unfortunately there's still the question of
>adding a scalar to every element in sparse matrix, which is not allowed
>for sparse matrices, and which is not possible to avoid in the
>equation.
>
>Sincerely,
>Basil Beirouti
>
>
>> On Jul 1, 2016, at 4:36 PM, scikit-learn-request at python.org wrote:
>>
>> Send scikit-learn mailing list submissions to
>> scikit-learn at python.org
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> or, via email, send a message with subject or body 'help' to
>> scikit-learn-request at python.org
>>
>> You can reach the person managing the list at
>> scikit-learn-owner at python.org
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of scikit-learn digest..."
>>
>>
>> Today's Topics:
>>
>> 1. Adding BM25 to scikit-learn.feature_extraction.text
>> (Basil Beirouti)
>> 2. Re: Adding BM25 to scikit-learn.feature_extraction.text
>> (Vlad Niculae)
>>
>>
>>
>----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Fri, 1 Jul 2016 16:17:43 -0500
>> From: Basil Beirouti <basilbeirouti at gmail.com>
>> To: scikit-learn at python.org
>> Subject: [scikit-learn] Adding BM25 to
>> scikit-learn.feature_extraction.text
>> Message-ID:
>>
><CAB4mTg8805nNdAja5cscf+pHrJyq0btC-AGzegd8Cqb95sVdHw at mail.gmail.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>> Hi everyone,
>>
>> to put it succinctly, here's the BM25 equation:
>>
>> f(w,D) * (k+1) / (k*B + f(w,D))
>>
>> where w is the word, and D is the document (corresponding to rows and
>> columns, respectively). f is a sparse matrix because only a fraction
>of the
>> whole vocabulary of words appears in any given single document.
>>
>> B is a function of only the document, but it doesn't matter, you can
>think
>> of it as a constant if you want.
>>
>> The problem is since f(w,D) is almost always zero, I only need to do
>the
>> calculation (ie. multiply by (k+1) then divide by (k*B + f(w,D)))
>when
>> f(w,D) is not zero. Is there a clever way to do this with masks?
>>
>> You can refactor the above equation to get this:
>>
>> (k+1)/(k*B/f(w,D) + 1) but alas we still have f(w,D) appearing in a
>> denominator, which is bad (because of dividing by zero).
>>
>> So anyway, currently I am converting to a coo_matrix and iterator
>through
>> the non-zero values like this:
>>
>> cx = x.tocoo()
>> for i,j,v in itertools.izip(cx.row, cx.col, cx.data):
>> (i,j,v)
>>
>>
>> That iterator is incredibly fast, but unfortunately coo_matrix does
>> not support assignment. So I create a new copy of either a dok sparse
>> matrix or a regular numpy array and assign to that.
>>
>> I could also deal directly with the .data, .indptr, and indices
>> attributes of csr_matrix, and see if it's possible to create a copy
>of
>> .data attribute and update the values accordingly. I was hoping
>> somebody had encountered this type of issue before.
>>
>> Sincerely,
>>
>> Basil Beirouti
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL:
><http://mail.python.org/pipermail/scikit-learn/attachments/20160701/8970d05a/attachment-0001.html>
>>
>> ------------------------------
>>
>> Message: 2
>> Date: Fri, 01 Jul 2016 17:35:49 -0400
>> From: Vlad Niculae <zephyr14 at gmail.com>
>> To: Scikit-learn user and developer mailing list
>> <scikit-learn at python.org>
>> Subject: Re: [scikit-learn] Adding BM25 to
>> scikit-learn.feature_extraction.text
>> Message-ID: <D4036481-5AC4-44A6-810B-F347339557C9 at gmail.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>> Hi Basil,
>>
>> If B were just a constant, you could do the whole thing as a
>vectorized operation on X.data.
>>
>> Since I understand B is a n_samples vector, I think the cleanest way
>to compute the denominator is using
>sklearn.utils.sparsefuncs.inplace_row_scale.
>>
>> Hope this helps,
>>
>> Vlad
>>
>>
>>> On July 1, 2016 5:17:43 PM EDT, Basil Beirouti
><basilbeirouti at gmail.com> wrote:
>>> Hi everyone,
>>>
>>> to put it succinctly, here's the BM25 equation:
>>>
>>> f(w,D) * (k+1) / (k*B + f(w,D))
>>>
>>> where w is the word, and D is the document (corresponding to rows
>and
>>> columns, respectively). f is a sparse matrix because only a fraction
>of
>>> the
>>> whole vocabulary of words appears in any given single document.
>>>
>>> B is a function of only the document, but it doesn't matter, you can
>>> think
>>> of it as a constant if you want.
>>>
>>> The problem is since f(w,D) is almost always zero, I only need to do
>>> the
>>> calculation (ie. multiply by (k+1) then divide by (k*B + f(w,D)))
>when
>>> f(w,D) is not zero. Is there a clever way to do this with masks?
>>>
>>> You can refactor the above equation to get this:
>>>
>>> (k+1)/(k*B/f(w,D) + 1) but alas we still have f(w,D) appearing in a
>>> denominator, which is bad (because of dividing by zero).
>>>
>>> So anyway, currently I am converting to a coo_matrix and iterator
>>> through
>>> the non-zero values like this:
>>>
>>> cx = x.tocoo()
>>> for i,j,v in itertools.izip(cx.row, cx.col, cx.data):
>>> (i,j,v)
>>>
>>>
>>> That iterator is incredibly fast, but unfortunately coo_matrix does
>>> not support assignment. So I create a new copy of either a dok
>sparse
>>> matrix or a regular numpy array and assign to that.
>>>
>>> I could also deal directly with the .data, .indptr, and indices
>>> attributes of csr_matrix, and see if it's possible to create a copy
>of
>>> .data attribute and update the values accordingly. I was hoping
>>> somebody had encountered this type of issue before.
>>>
>>> Sincerely,
>>>
>>> Basil Beirouti
>>>
>>>
>>>
>------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> --
>> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL:
><http://mail.python.org/pipermail/scikit-learn/attachments/20160701/ca1e4e96/attachment.html>
>>
>> ------------------------------
>>
>> Subject: Digest Footer
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>> ------------------------------
>>
>> End of scikit-learn Digest, Vol 4, Issue 3
>> ******************************************
>_______________________________________________
>scikit-learn mailing list
>scikit-learn at python.org
>https://mail.python.org/mailman/listinfo/scikit-learn
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160701/32099a46/attachment-0001.html>
More information about the scikit-learn
mailing list