[scikit-learn] Adding BM25 relevance function

Tue Jun 14 12:11:10 EDT 2016

Hey,

Good thing that you are trying to finish this.

Well, I looked into my old notes, and the Delta tf-idf comes from the "Delta
TFIDF: An Improved Feature Space for Sentiment Analysis"
<http://ebiquity.umbc.edu/_file_directory_/papers/446.pdf> paper. I guess
it is not very popular and apparently it has a drawback: it does not take
into account the number of times a word occurs in each document while
calculating the distribution amongst classes. At least that is what I wrote
on my notes...

As for the delta idf... If it helps, I can look into my old code cause I do
not know what I was talking about. I guess it has to do somehow with the
paper cited before.

Cheers,

Pavel Soriano

On Tue, Jun 14, 2016 at 5:49 PM Basil Beirouti <basilbeirouti at gmail.com>
wrote:

> Hi Joel,
>
> Thanks for your response and for digging up that archived thread, it gives
> me a lot of clarity.
>
> I see your point about BM25, but I think in most cases where TFIDF makes
> sense, BM25 makes sense as well, but it could be "overkill".
>
> Consider that TFIDF does not produce normalized results either
> <http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#example-text-document-clustering-py>,
> If BM25 requires dimensionality reduction (eg. using LSA) , so too would
> TFIDF. The term-document matrix is the same size no matter which weighting
> scheme is used. The only difference is that BM25 produces better results
> when the corpus is large enough that the term frequency in a document, and
> the document frequency in the corpus, can vary considerably across a broad
> range of values.Maybe you could even say TFIDF and BM25 are the same
> equation except BM25 has a few additional hyperparameters (b and k).
>
> So is the advantage that BM25 provides for large diverse corpora with it?
> or is it marginal? Perhaps you can point me to some more examples where
> TFIDF is used (in supervised setting preferably) and I can plug in BM25 in
> place of TFIDF and see how it compares. Here are some I found:
>
>
> http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
> *(supervised)*
>
> http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#example-text-document-clustering-py
> (*unsupervised)*
>
> Thank you!
> Basil
>
> PS: By the way, I'm not familiar with the delta-idf transform that Pavel
> mentions in the archive you linked, I'll have to delve deeper into that. I
> agree with the response to Pavel that he should be putting it in a separate
> class, not adding on to the TFIDF. I think it would take me about 6-8 weeks
> to adapt my code to the fit transform model and submit a pull request.
>
>
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-- 
Pavel SORIANO

PhD Student
ERIC Laboratory
Université de Lyon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160614/cbe49979/attachment.html>