Adding BM25 to scikit-learn.feature_extraction.text
Hi everyone, to put it succinctly, here's the BM25 equation: f(w,D) * (k+1) / (k*B + f(w,D)) where w is the word, and D is the document (corresponding to rows and columns, respectively). f is a sparse matrix because only a fraction of the whole vocabulary of words appears in any given single document. B is a function of only the document, but it doesn't matter, you can think of it as a constant if you want. The problem is since f(w,D) is almost always zero, I only need to do the calculation (ie. multiply by (k+1) then divide by (k*B + f(w,D))) when f(w,D) is not zero. Is there a clever way to do this with masks? You can refactor the above equation to get this: (k+1)/(k*B/f(w,D) + 1) but alas we still have f(w,D) appearing in a denominator, which is bad (because of dividing by zero). So anyway, currently I am converting to a coo_matrix and iterator through the non-zero values like this: cx = x.tocoo() for i,j,v in itertools.izip(cx.row, cx.col, cx.data): (i,j,v) That iterator is incredibly fast, but unfortunately coo_matrix does not support assignment. So I create a new copy of either a dok sparse matrix or a regular numpy array and assign to that. I could also deal directly with the .data, .indptr, and indices attributes of csr_matrix, and see if it's possible to create a copy of .data attribute and update the values accordingly. I was hoping somebody had encountered this type of issue before. Sincerely, Basil Beirouti
Hi Basil, If B were just a constant, you could do the whole thing as a vectorized operation on X.data. Since I understand B is a n_samples vector, I think the cleanest way to compute the denominator is using sklearn.utils.sparsefuncs.inplace_row_scale. Hope this helps, Vlad On July 1, 2016 5:17:43 PM EDT, Basil Beirouti <basilbeirouti@gmail.com> wrote:
Hi everyone,
to put it succinctly, here's the BM25 equation:
f(w,D) * (k+1) / (k*B + f(w,D))
where w is the word, and D is the document (corresponding to rows and columns, respectively). f is a sparse matrix because only a fraction of the whole vocabulary of words appears in any given single document.
B is a function of only the document, but it doesn't matter, you can think of it as a constant if you want.
The problem is since f(w,D) is almost always zero, I only need to do the calculation (ie. multiply by (k+1) then divide by (k*B + f(w,D))) when f(w,D) is not zero. Is there a clever way to do this with masks?
You can refactor the above equation to get this:
(k+1)/(k*B/f(w,D) + 1) but alas we still have f(w,D) appearing in a denominator, which is bad (because of dividing by zero).
So anyway, currently I am converting to a coo_matrix and iterator through the non-zero values like this:
cx = x.tocoo() for i,j,v in itertools.izip(cx.row, cx.col, cx.data): (i,j,v)
That iterator is incredibly fast, but unfortunately coo_matrix does not support assignment. So I create a new copy of either a dok sparse matrix or a regular numpy array and assign to that.
I could also deal directly with the .data, .indptr, and indices attributes of csr_matrix, and see if it's possible to create a copy of .data attribute and update the values accordingly. I was hoping somebody had encountered this type of issue before.
Sincerely,
Basil Beirouti
------------------------------------------------------------------------
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- Sent from my Android device with K-9 Mail. Please excuse my brevity.
participants (2)
-
Basil Beirouti -
Vlad Niculae