[scikit-learn] Bm25

Fri Jul 1 21:53:28 EDT 2016

For the first question, look up the possible ways to construct scipy.sparse.csr_matrix objects; one of them will take (data, indices, indptr). 
Just pass a new array for data, and take the latter two from X.

For the second question, you can just do the elementwise operation in place on the data array, since they have the same shape in this case.

You can try playing around with these operations in a notebook and benchmarking them with %timeit/%memit, to see how to best organize them. I find such exercises very rewarding.

Cheers,
Vlad

On July 1, 2016 6:47:40 PM EDT, Basil Beirouti <basilbeirouti at gmail.com> wrote:
>Oh yes that's exactly what I was looking for. So how do I initialize an
>array with the same sparsity pattern as X? And then how do I do an
>element wise divide of the numerator over the denominator, when both
>are sparse matrices? Like you said it should only do this operation on
>the non zero elements of the numerator.
>
>Sent from my iPhone
>
>> On Jul 1, 2016, at 5:36 PM, Vlad Niculae <zephyr14 at gmail.com> wrote:
>> 
>> In the denominator you mean? It looks like you only need to add that
>to nonzero elements, since the others would all have a 0 in the
>numerator, right? So the final value would be zero there. Or am I
>missing something?
>> 
>> You can initialize an array with the same sparsity pattern as X, but
>its data is k everywhere. Then use inplace_row_scale to multiply it by
>B, then add this to X to get the denominator.
>> 
>>> On July 1, 2016 6:27:41 PM EDT, Basil Beirouti
><basilbeirouti at gmail.com> wrote:
>>> Hi Vlad,
>>> 
>>> Thanks for the quick reply. Unfortunately there's still the question
>of adding a scalar to every element in sparse matrix, which is not
>allowed for sparse matrices, and which is not possible to avoid in the
>equation.
>>> 
>>> Sincerely,
>>> Basil Beirouti 
>>> 
>>> 
>>>>  On Jul 1, 2016, at 4:36 PM, scikit-learn-request at python.org wrote:
>>>>  
>>>>  Send scikit-learn mailing list submissions to
>>>>     scikit-learn at python.org
>>>>  
>>>>  To subscribe or unsubscribe via the World Wide Web, visit
>>>>     https://mail.python.org/mailman/listinfo/scikit-learn
>>>>  or, via email, send a message with subject or body 'help' to
>>>>     scikit-learn-request at python.org
>>>>  
>>>>  You can reach the person managing the list at
>>>>    
>>>> scikit-learn-owner at python.org
>>>>  
>>>>  When replying, please edit your Subject line so it is more
>specific
>>>>  than "Re: Contents of scikit-learn digest..."
>>>>  
>>>>  
>>>>  Today's Topics:
>>>>  
>>>>    1. Adding BM25 to scikit-learn.feature_extraction.text
>>>>       (Basil Beirouti)
>>>>    2. Re: Adding BM25 to scikit-learn.feature_extraction.text
>>>>       (Vlad Niculae)
>>>>  
>>>>  
>>>> 
>>>>  
>>>>  Message: 1
>>>>  Date: Fri, 1 Jul 2016 16:17:43 -0500
>>>>  From: Basil Beirouti <basilbeirouti at gmail.com>
>>>>  To: scikit-learn at python.org
>>>>  Subject: [scikit-learn] Adding BM25 to
>>>>     scikit-learn.feature_extraction.text
>>>>  Message-ID:
>>>>    
><CAB4mTg8805nNdAja5cscf+pHrJyq0btC-AGzegd8Cqb95sVdHw at mail.gmail.com>
>>>>  Content-Type: text/plain; charset="utf-8"
>>>>  
>>>>  Hi everyone,
>>>>  
>>>>  to put it succinctly, here's the BM25 equation:
>>>>  
>>>>  f(w,D) * (k+1) / (k*B + f(w,D))
>>>>  
>>>>  where w is the word, and D is the
>>>> document (corresponding to rows and
>>>>  columns, respectively). f is a sparse matrix because only a
>fraction of the
>>>>  whole vocabulary of words appears in any given single document.
>>>>  
>>>>  B is a function of only the document, but it doesn't matter, you
>can think
>>>>  of it as a constant if you want.
>>>>  
>>>>  The problem is since f(w,D) is almost always zero, I only need to
>do the
>>>>  calculation (ie. multiply by (k+1) then divide by (k*B + f(w,D)))
>when
>>>>  f(w,D) is not zero. Is there a clever way to do this with masks?
>>>>  
>>>>  You can refactor the above equation to get this:
>>>>  
>>>>  (k+1)/(k*B/f(w,D) + 1) but alas we still have f(w,D) appearing in
>a
>>>>  denominator, which is bad (because of dividing by zero).
>>>>  
>>>>  So anyway, currently I am converting to a coo_matrix and iterator
>through
>>>>  the non-zero values like this:
>>>>  
>>>>     cx = x.tocoo()
>>>>     for i,j,v in itertools.izip(cx.row, cx.col, cx.data):
>>>>         (i,j,v)
>>>>  
>>>>  
>>>>  That iterator is incredibly fast, but unfortunately coo_matrix
>does
>>>>  not support assignment. So I create a new copy of either a dok
>sparse
>>>>  matrix or a regular numpy array and assign to that.
>>>>  
>>>>  I could also deal directly with the .data, .indptr, and indices
>>>>  attributes of csr_matrix, and see if it's possible to create a
>copy of
>>>>  .data attribute and update the values accordingly. I was hoping
>>>>  somebody had encountered this type of issue before.
>>>>  
>>>>  Sincerely,
>>>>  
>>>>  Basil Beirouti
>>>>  -------------- next part --------------
>>>>  An HTML attachment was scrubbed...
>>>>  URL:
><http://mail.python.org/pipermail/scikit-learn/attachments/20160701/8970d05a/attachment-0001.html>
>>>>  
>>>> 
>>>>  
>>>>  Message: 2
>>>>  Date: Fri, 01 Jul 2016 17:35:49 -0400
>>>>  From: Vlad Niculae
>>>> <zephyr14 at gmail.com>
>>>>  To: Scikit-learn user and developer mailing list
>>>>     <scikit-learn at python.org>
>>>>  Subject: Re: [scikit-learn] Adding BM25 to
>>>>     scikit-learn.feature_extraction.text
>>>>  Message-ID: <D4036481-5AC4-44A6-810B-F347339557C9 at gmail.com>
>>>>  Content-Type: text/plain; charset="utf-8"
>>>>  
>>>>  Hi Basil,
>>>>  
>>>>  If B were just a constant, you could do the whole thing as a
>vectorized operation on X.data.
>>>>  
>>>>  Since I understand B is a n_samples vector, I think the cleanest
>way to compute the denominator is using
>sklearn.utils.sparsefuncs.inplace_row_scale.
>>>>  
>>>>  Hope this helps,
>>>>  
>>>>  Vlad
>>>>  
>>>>  
>>>>>  On July 1, 2016 5:17:43 PM EDT, Basil Beirouti
><basilbeirouti at gmail.com> wrote:
>>>>>  Hi everyone,
>>>>>  
>>>>>  to put it succinctly, here's the BM25 equation:
>>>>>  
>>>>> 
>>>>> f(w,D) * (k+1) / (k*B + f(w,D))
>>>>>  
>>>>>  where w is the word, and D is the document (corresponding to rows
>and
>>>>>  columns, respectively). f is a sparse matrix because only a
>fraction of
>>>>>  the
>>>>>  whole vocabulary of words appears in any given single document.
>>>>>  
>>>>>  B is a function of only the document, but it doesn't matter, you
>can
>>>>>  think
>>>>>  of it as a constant if you want.
>>>>>  
>>>>>  The problem is since f(w,D) is almost always zero, I only need to
>do
>>>>>  the
>>>>>  calculation (ie. multiply by (k+1) then divide by (k*B + f(w,D)))
>when
>>>>>  f(w,D) is not zero. Is there a clever way to do this with masks?
>>>>>  
>>>>>  You can refactor the above equation to get this:
>>>>>  
>>>>>  (k+1)/(k*B/f(w,D) + 1) but alas we still have f(w,D) appearing in
>a
>>>>>  denominator, which is bad (because of dividing by zero).
>>>>>  
>>>>>  So anyway, currently I am converting to a coo_matrix and iterator
>>>>>  through
>>>>>  the non-zero values like this:
>>>>>  
>>>>>   
>>>>> cx = x.tocoo()
>>>>>    for i,j,v in itertools.izip(cx.row, cx.col, cx.data):
>>>>>        (i,j,v)
>>>>>  
>>>>>  
>>>>>  That iterator is incredibly fast, but unfortunately coo_matrix
>does
>>>>>  not support assignment. So I create a new copy of either a dok
>sparse
>>>>>  matrix or a regular numpy array and assign to that.
>>>>>  
>>>>>  I could also deal directly with the .data, .indptr, and indices
>>>>>  attributes of csr_matrix, and see if it's possible to create a
>copy of
>>>>>  .data attribute and update the values accordingly. I was hoping
>>>>>  somebody had encountered this type of issue before.
>>>>>  
>>>>>  Sincerely,
>>>>>  
>>>>>  Basil Beirouti
>>>>>  
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>>  scikit-learn mailing list
>>>>>  scikit-learn at python.org
>>>>>  https://mail.python.org/mailman/listinfo/scikit-learn
>>>>  
>>>>  -- 
>>>>  Sent from my Android device with K-9 Mail. Please excuse my
>brevity.
>>>>  -------------- next part --------------
>>>>  An HTML attachment was scrubbed...
>>>>  URL:
><http://mail.python.org/pipermail/scikit-learn/attachments/20160701/ca1e4e96/attachment.html>
>>>>  
>>>> 
>>>>  
>>>>  Subject: Digest Footer
>>>>  
>>>> 
>>>>  scikit-learn mailing list
>>>>  scikit-learn at python.org
>>>>  https://mail.python.org/mailman/listinfo/scikit-learn
>>>>  
>>>>  
>>>> 
>>>>  
>>>>  End of scikit-learn Digest, Vol 4, Issue 3
>>>>  ******************************************
>>> 
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> -- 
>> Sent from my Android device with K-9 Mail. Please excuse my brevity.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160701/110a52f8/attachment-0001.html>