[scikit-learn] Bm25
Basil Beirouti
basilbeirouti at gmail.com
Fri Jul 1 18:27:41 EDT 2016
Hi Vlad,
Thanks for the quick reply. Unfortunately there's still the question of adding a scalar to every element in sparse matrix, which is not allowed for sparse matrices, and which is not possible to avoid in the equation.
Sincerely,
Basil Beirouti
> On Jul 1, 2016, at 4:36 PM, scikit-learn-request at python.org wrote:
>
> Send scikit-learn mailing list submissions to
> scikit-learn at python.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://mail.python.org/mailman/listinfo/scikit-learn
> or, via email, send a message with subject or body 'help' to
> scikit-learn-request at python.org
>
> You can reach the person managing the list at
> scikit-learn-owner at python.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of scikit-learn digest..."
>
>
> Today's Topics:
>
> 1. Adding BM25 to scikit-learn.feature_extraction.text
> (Basil Beirouti)
> 2. Re: Adding BM25 to scikit-learn.feature_extraction.text
> (Vlad Niculae)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 1 Jul 2016 16:17:43 -0500
> From: Basil Beirouti <basilbeirouti at gmail.com>
> To: scikit-learn at python.org
> Subject: [scikit-learn] Adding BM25 to
> scikit-learn.feature_extraction.text
> Message-ID:
> <CAB4mTg8805nNdAja5cscf+pHrJyq0btC-AGzegd8Cqb95sVdHw at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hi everyone,
>
> to put it succinctly, here's the BM25 equation:
>
> f(w,D) * (k+1) / (k*B + f(w,D))
>
> where w is the word, and D is the document (corresponding to rows and
> columns, respectively). f is a sparse matrix because only a fraction of the
> whole vocabulary of words appears in any given single document.
>
> B is a function of only the document, but it doesn't matter, you can think
> of it as a constant if you want.
>
> The problem is since f(w,D) is almost always zero, I only need to do the
> calculation (ie. multiply by (k+1) then divide by (k*B + f(w,D))) when
> f(w,D) is not zero. Is there a clever way to do this with masks?
>
> You can refactor the above equation to get this:
>
> (k+1)/(k*B/f(w,D) + 1) but alas we still have f(w,D) appearing in a
> denominator, which is bad (because of dividing by zero).
>
> So anyway, currently I am converting to a coo_matrix and iterator through
> the non-zero values like this:
>
> cx = x.tocoo()
> for i,j,v in itertools.izip(cx.row, cx.col, cx.data):
> (i,j,v)
>
>
> That iterator is incredibly fast, but unfortunately coo_matrix does
> not support assignment. So I create a new copy of either a dok sparse
> matrix or a regular numpy array and assign to that.
>
> I could also deal directly with the .data, .indptr, and indices
> attributes of csr_matrix, and see if it's possible to create a copy of
> .data attribute and update the values accordingly. I was hoping
> somebody had encountered this type of issue before.
>
> Sincerely,
>
> Basil Beirouti
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160701/8970d05a/attachment-0001.html>
>
> ------------------------------
>
> Message: 2
> Date: Fri, 01 Jul 2016 17:35:49 -0400
> From: Vlad Niculae <zephyr14 at gmail.com>
> To: Scikit-learn user and developer mailing list
> <scikit-learn at python.org>
> Subject: Re: [scikit-learn] Adding BM25 to
> scikit-learn.feature_extraction.text
> Message-ID: <D4036481-5AC4-44A6-810B-F347339557C9 at gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hi Basil,
>
> If B were just a constant, you could do the whole thing as a vectorized operation on X.data.
>
> Since I understand B is a n_samples vector, I think the cleanest way to compute the denominator is using sklearn.utils.sparsefuncs.inplace_row_scale.
>
> Hope this helps,
>
> Vlad
>
>
>> On July 1, 2016 5:17:43 PM EDT, Basil Beirouti <basilbeirouti at gmail.com> wrote:
>> Hi everyone,
>>
>> to put it succinctly, here's the BM25 equation:
>>
>> f(w,D) * (k+1) / (k*B + f(w,D))
>>
>> where w is the word, and D is the document (corresponding to rows and
>> columns, respectively). f is a sparse matrix because only a fraction of
>> the
>> whole vocabulary of words appears in any given single document.
>>
>> B is a function of only the document, but it doesn't matter, you can
>> think
>> of it as a constant if you want.
>>
>> The problem is since f(w,D) is almost always zero, I only need to do
>> the
>> calculation (ie. multiply by (k+1) then divide by (k*B + f(w,D))) when
>> f(w,D) is not zero. Is there a clever way to do this with masks?
>>
>> You can refactor the above equation to get this:
>>
>> (k+1)/(k*B/f(w,D) + 1) but alas we still have f(w,D) appearing in a
>> denominator, which is bad (because of dividing by zero).
>>
>> So anyway, currently I am converting to a coo_matrix and iterator
>> through
>> the non-zero values like this:
>>
>> cx = x.tocoo()
>> for i,j,v in itertools.izip(cx.row, cx.col, cx.data):
>> (i,j,v)
>>
>>
>> That iterator is incredibly fast, but unfortunately coo_matrix does
>> not support assignment. So I create a new copy of either a dok sparse
>> matrix or a regular numpy array and assign to that.
>>
>> I could also deal directly with the .data, .indptr, and indices
>> attributes of csr_matrix, and see if it's possible to create a copy of
>> .data attribute and update the values accordingly. I was hoping
>> somebody had encountered this type of issue before.
>>
>> Sincerely,
>>
>> Basil Beirouti
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160701/ca1e4e96/attachment.html>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ------------------------------
>
> End of scikit-learn Digest, Vol 4, Issue 3
> ******************************************
More information about the scikit-learn
mailing list