[scikit-learn] Bm25

Basil Beirouti basilbeirouti at gmail.com
Fri Jul 1 18:27:41 EDT 2016


Hi Vlad,

Thanks for the quick reply. Unfortunately there's still the question of adding a scalar to every element in sparse matrix, which is not allowed for sparse matrices, and which is not possible to avoid in the equation.

Sincerely,
Basil Beirouti 


> On Jul 1, 2016, at 4:36 PM, scikit-learn-request at python.org wrote:
> 
> Send scikit-learn mailing list submissions to
>    scikit-learn at python.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>    https://mail.python.org/mailman/listinfo/scikit-learn
> or, via email, send a message with subject or body 'help' to
>    scikit-learn-request at python.org
> 
> You can reach the person managing the list at
>    scikit-learn-owner at python.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of scikit-learn digest..."
> 
> 
> Today's Topics:
> 
>   1. Adding BM25 to scikit-learn.feature_extraction.text
>      (Basil Beirouti)
>   2. Re: Adding BM25 to scikit-learn.feature_extraction.text
>      (Vlad Niculae)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Fri, 1 Jul 2016 16:17:43 -0500
> From: Basil Beirouti <basilbeirouti at gmail.com>
> To: scikit-learn at python.org
> Subject: [scikit-learn] Adding BM25 to
>    scikit-learn.feature_extraction.text
> Message-ID:
>    <CAB4mTg8805nNdAja5cscf+pHrJyq0btC-AGzegd8Cqb95sVdHw at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
> 
> Hi everyone,
> 
> to put it succinctly, here's the BM25 equation:
> 
> f(w,D) * (k+1) / (k*B + f(w,D))
> 
> where w is the word, and D is the document (corresponding to rows and
> columns, respectively). f is a sparse matrix because only a fraction of the
> whole vocabulary of words appears in any given single document.
> 
> B is a function of only the document, but it doesn't matter, you can think
> of it as a constant if you want.
> 
> The problem is since f(w,D) is almost always zero, I only need to do the
> calculation (ie. multiply by (k+1) then divide by (k*B + f(w,D))) when
> f(w,D) is not zero. Is there a clever way to do this with masks?
> 
> You can refactor the above equation to get this:
> 
> (k+1)/(k*B/f(w,D) + 1) but alas we still have f(w,D) appearing in a
> denominator, which is bad (because of dividing by zero).
> 
> So anyway, currently I am converting to a coo_matrix and iterator through
> the non-zero values like this:
> 
>    cx = x.tocoo()
>    for i,j,v in itertools.izip(cx.row, cx.col, cx.data):
>        (i,j,v)
> 
> 
> That iterator is incredibly fast, but unfortunately coo_matrix does
> not support assignment. So I create a new copy of either a dok sparse
> matrix or a regular numpy array and assign to that.
> 
> I could also deal directly with the .data, .indptr, and indices
> attributes of csr_matrix, and see if it's possible to create a copy of
> .data attribute and update the values accordingly. I was hoping
> somebody had encountered this type of issue before.
> 
> Sincerely,
> 
> Basil Beirouti
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160701/8970d05a/attachment-0001.html>
> 
> ------------------------------
> 
> Message: 2
> Date: Fri, 01 Jul 2016 17:35:49 -0400
> From: Vlad Niculae <zephyr14 at gmail.com>
> To: Scikit-learn user and developer mailing list
>    <scikit-learn at python.org>
> Subject: Re: [scikit-learn] Adding BM25 to
>    scikit-learn.feature_extraction.text
> Message-ID: <D4036481-5AC4-44A6-810B-F347339557C9 at gmail.com>
> Content-Type: text/plain; charset="utf-8"
> 
> Hi Basil,
> 
> If B were just a constant, you could do the whole thing as a vectorized operation on X.data.
> 
> Since I understand B is a n_samples vector, I think the cleanest way to compute the denominator is using sklearn.utils.sparsefuncs.inplace_row_scale.
> 
> Hope this helps,
> 
> Vlad
> 
> 
>> On July 1, 2016 5:17:43 PM EDT, Basil Beirouti <basilbeirouti at gmail.com> wrote:
>> Hi everyone,
>> 
>> to put it succinctly, here's the BM25 equation:
>> 
>> f(w,D) * (k+1) / (k*B + f(w,D))
>> 
>> where w is the word, and D is the document (corresponding to rows and
>> columns, respectively). f is a sparse matrix because only a fraction of
>> the
>> whole vocabulary of words appears in any given single document.
>> 
>> B is a function of only the document, but it doesn't matter, you can
>> think
>> of it as a constant if you want.
>> 
>> The problem is since f(w,D) is almost always zero, I only need to do
>> the
>> calculation (ie. multiply by (k+1) then divide by (k*B + f(w,D))) when
>> f(w,D) is not zero. Is there a clever way to do this with masks?
>> 
>> You can refactor the above equation to get this:
>> 
>> (k+1)/(k*B/f(w,D) + 1) but alas we still have f(w,D) appearing in a
>> denominator, which is bad (because of dividing by zero).
>> 
>> So anyway, currently I am converting to a coo_matrix and iterator
>> through
>> the non-zero values like this:
>> 
>>   cx = x.tocoo()
>>   for i,j,v in itertools.izip(cx.row, cx.col, cx.data):
>>       (i,j,v)
>> 
>> 
>> That iterator is incredibly fast, but unfortunately coo_matrix does
>> not support assignment. So I create a new copy of either a dok sparse
>> matrix or a regular numpy array and assign to that.
>> 
>> I could also deal directly with the .data, .indptr, and indices
>> attributes of csr_matrix, and see if it's possible to create a copy of
>> .data attribute and update the values accordingly. I was hoping
>> somebody had encountered this type of issue before.
>> 
>> Sincerely,
>> 
>> Basil Beirouti
>> 
>> 
>> ------------------------------------------------------------------------
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> -- 
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160701/ca1e4e96/attachment.html>
> 
> ------------------------------
> 
> Subject: Digest Footer
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> ------------------------------
> 
> End of scikit-learn Digest, Vol 4, Issue 3
> ******************************************


More information about the scikit-learn mailing list