[scikit-learn] Adding BM25 relevance function to sklearn.feature_extraction.text

Mon Jun 13 22:44:36 EDT 2016

Hello all,

You can use sklearn.feature_extraction.text.TfidfVectorizer to learn a
corpus of documents and rank them in order of relevance to a new previously
unseen query.

BM25 works in a similar manner to TfidfVectorizer, but is more complex and
considered one of the most successful information retrieval algorithms.

I currently have code that implements BM25 quite efficiently to learn a
corpus of documents and I want to modify/port it to align with the
fit-transform framework of sklearn. I think it could fit neatly into the
current codebase.

Questions:
1.) Would this be a desirable feature?
2.) Any advice for how to proceed with this? Things to watch out for?

Any and all advice is welcome.

Thanks!
Basil
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160613/dbd09d23/attachment.html>