Re: [scikit-learn] Adding BM25 relevance function
Hi Joel, Thanks for your response and for digging up that archived thread, it gives me a lot of clarity. I see your point about BM25, but I think in most cases where TFIDF makes sense, BM25 makes sense as well, but it could be "overkill". Consider that TFIDF does not produce normalized results either <http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#e...>, If BM25 requires dimensionality reduction (eg. using LSA) , so too would TFIDF. The term-document matrix is the same size no matter which weighting scheme is used. The only difference is that BM25 produces better results when the corpus is large enough that the term frequency in a document, and the document frequency in the corpus, can vary considerably across a broad range of values.Maybe you could even say TFIDF and BM25 are the same equation except BM25 has a few additional hyperparameters (b and k). So is the advantage that BM25 provides for large diverse corpora with it? or is it marginal? Perhaps you can point me to some more examples where TFIDF is used (in supervised setting preferably) and I can plug in BM25 in place of TFIDF and see how it compares. Here are some I found: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_dat... *(supervised)* http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#e... (*unsupervised)* Thank you! Basil PS: By the way, I'm not familiar with the delta-idf transform that Pavel mentions in the archive you linked, I'll have to delve deeper into that. I agree with the response to Pavel that he should be putting it in a separate class, not adding on to the TFIDF. I think it would take me about 6-8 weeks to adapt my code to the fit transform model and submit a pull request.
Hey, Good thing that you are trying to finish this. Well, I looked into my old notes, and the Delta tf-idf comes from the "Delta TFIDF: An Improved Feature Space for Sentiment Analysis" <http://ebiquity.umbc.edu/_file_directory_/papers/446.pdf> paper. I guess it is not very popular and apparently it has a drawback: it does not take into account the number of times a word occurs in each document while calculating the distribution amongst classes. At least that is what I wrote on my notes... As for the delta idf... If it helps, I can look into my old code cause I do not know what I was talking about. I guess it has to do somehow with the paper cited before. Cheers, Pavel Soriano On Tue, Jun 14, 2016 at 5:49 PM Basil Beirouti <basilbeirouti@gmail.com> wrote:
Hi Joel,
Thanks for your response and for digging up that archived thread, it gives me a lot of clarity.
I see your point about BM25, but I think in most cases where TFIDF makes sense, BM25 makes sense as well, but it could be "overkill".
Consider that TFIDF does not produce normalized results either <http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#e...>, If BM25 requires dimensionality reduction (eg. using LSA) , so too would TFIDF. The term-document matrix is the same size no matter which weighting scheme is used. The only difference is that BM25 produces better results when the corpus is large enough that the term frequency in a document, and the document frequency in the corpus, can vary considerably across a broad range of values.Maybe you could even say TFIDF and BM25 are the same equation except BM25 has a few additional hyperparameters (b and k).
So is the advantage that BM25 provides for large diverse corpora with it? or is it marginal? Perhaps you can point me to some more examples where TFIDF is used (in supervised setting preferably) and I can plug in BM25 in place of TFIDF and see how it compares. Here are some I found:
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_dat... *(supervised)*
http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#e... (*unsupervised)*
Thank you! Basil
PS: By the way, I'm not familiar with the delta-idf transform that Pavel mentions in the archive you linked, I'll have to delve deeper into that. I agree with the response to Pavel that he should be putting it in a separate class, not adding on to the TFIDF. I think it would take me about 6-8 weeks to adapt my code to the fit transform model and submit a pull request.
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- Pavel SORIANO PhD Student ERIC Laboratory Université de Lyon
participants (2)
-
Basil Beirouti -
Pavel Soriano