[scikit-learn] Adding BM25 relevance function to sklearn.feature_extraction.text

Joel Nothman joel.nothman at gmail.com
Tue Jun 14 00:00:55 EDT 2016


Hi Basil,

Scikit-learn isn't a library for information retrieval. The question is:
how useful is the BM25 feature reweighting in a machine learning context?

This has been previously discussed at
https://www.mail-archive.com/scikit-learn-general@lists.sourceforge.net/msg11353.html.
The whole thread is worth reading. Despite enthusiasm, it never got as far
as a pull request. And still the major burden is showing that this
transformation helps for classification/clustering.

Joel


On 14 June 2016 at 12:44, Basil Beirouti <basilbeirouti at gmail.com> wrote:

> Hello all,
>
> You can use sklearn.feature_extraction.text.TfidfVectorizer to learn a
> corpus of documents and rank them in order of relevance to a new previously
> unseen query.
>
> BM25 works in a similar manner to TfidfVectorizer, but is more complex and
> considered one of the most successful information retrieval algorithms.
>
> I currently have code that implements BM25 quite efficiently to learn a
> corpus of documents and I want to modify/port it to align with the
> fit-transform framework of sklearn. I think it could fit neatly into the
> current codebase.
>
> Questions:
> 1.) Would this be a desirable feature?
> 2.) Any advice for how to proceed with this? Things to watch out for?
>
> Any and all advice is welcome.
>
> Thanks!
> Basil
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160614/f7a4c944/attachment.html>


More information about the scikit-learn mailing list