[scikit-learn] Adding BM25 to scikit-learn.feature_extraction.text

Vlad Niculae zephyr14 at gmail.com
Fri Jul 1 17:35:49 EDT 2016

Hi Basil,

If B were just a constant, you could do the whole thing as a vectorized operation on X.data.

Since I understand B is a n_samples vector, I think the cleanest way to compute the denominator is using sklearn.utils.sparsefuncs.inplace_row_scale.

Hope this helps,


On July 1, 2016 5:17:43 PM EDT, Basil Beirouti <basilbeirouti at gmail.com> wrote:
>Hi everyone,
>to put it succinctly, here's the BM25 equation:
>f(w,D) * (k+1) / (k*B + f(w,D))
>where w is the word, and D is the document (corresponding to rows and
>columns, respectively). f is a sparse matrix because only a fraction of
>whole vocabulary of words appears in any given single document.
>B is a function of only the document, but it doesn't matter, you can
>of it as a constant if you want.
>The problem is since f(w,D) is almost always zero, I only need to do
>calculation (ie. multiply by (k+1) then divide by (k*B + f(w,D))) when
>f(w,D) is not zero. Is there a clever way to do this with masks?
>You can refactor the above equation to get this:
>(k+1)/(k*B/f(w,D) + 1) but alas we still have f(w,D) appearing in a
>denominator, which is bad (because of dividing by zero).
>So anyway, currently I am converting to a coo_matrix and iterator
>the non-zero values like this:
>    cx = x.tocoo()
>    for i,j,v in itertools.izip(cx.row, cx.col, cx.data):
>        (i,j,v)
>That iterator is incredibly fast, but unfortunately coo_matrix does
>not support assignment. So I create a new copy of either a dok sparse
>matrix or a regular numpy array and assign to that.
>I could also deal directly with the .data, .indptr, and indices
>attributes of csr_matrix, and see if it's possible to create a copy of
>.data attribute and update the values accordingly. I was hoping
>somebody had encountered this type of issue before.
>Basil Beirouti
>scikit-learn mailing list
>scikit-learn at python.org

Sent from my Android device with K-9 Mail. Please excuse my brevity.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160701/ca1e4e96/attachment-0001.html>

More information about the scikit-learn mailing list