[scikit-learn] Adding BM25 relevance function

Basil Beirouti basilbeirouti at gmail.com
Tue Jun 14 11:47:35 EDT 2016


Hi Joel,

Thanks for your response and for digging up that archived thread, it gives
me a lot of clarity.

I see your point about BM25, but I think in most cases where TFIDF makes
sense, BM25 makes sense as well, but it could be "overkill".

Consider that TFIDF does not produce normalized results either
<http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#example-text-document-clustering-py>,
If BM25 requires dimensionality reduction (eg. using LSA) , so too would
TFIDF. The term-document matrix is the same size no matter which weighting
scheme is used. The only difference is that BM25 produces better results
when the corpus is large enough that the term frequency in a document, and
the document frequency in the corpus, can vary considerably across a broad
range of values.Maybe you could even say TFIDF and BM25 are the same
equation except BM25 has a few additional hyperparameters (b and k).

So is the advantage that BM25 provides for large diverse corpora with it?
or is it marginal? Perhaps you can point me to some more examples where
TFIDF is used (in supervised setting preferably) and I can plug in BM25 in
place of TFIDF and see how it compares. Here are some I found:

http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
*(supervised)*
http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#example-text-document-clustering-py
(*unsupervised)*

Thank you!
Basil

PS: By the way, I'm not familiar with the delta-idf transform that Pavel
mentions in the archive you linked, I'll have to delve deeper into that. I
agree with the response to Pavel that he should be putting it in a separate
class, not adding on to the TFIDF. I think it would take me about 6-8 weeks
to adapt my code to the fit transform model and submit a pull request.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160614/347c3b34/attachment-0001.html>


More information about the scikit-learn mailing list