<div dir="ltr"><div class="gmail_extra">Hi Joel,</div><div class="gmail_extra"><br></div><div class="gmail_extra">Thanks for your response and for digging up that archived thread, it gives me a lot of clarity. </div><div class="gmail_extra"><br></div><div class="gmail_extra">I see your point about BM25, but I think in most cases where TFIDF makes sense, BM25 makes sense as well, but it could be "overkill". </div><div class="gmail_extra"><br></div><div class="gmail_extra">Consider that TFIDF does not produce normalized results<a href="http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#example-text-document-clustering-py"> either</a>, If BM25 requires dimensionality reduction (eg. using LSA) , so too would TFIDF. The term-document matrix is the same size no matter which weighting scheme is used. The only difference is that BM25 produces better results when the corpus is large enough that the term frequency in a document, and the document frequency in the corpus, can vary considerably across a broad range of values.Maybe you could even say TFIDF and BM25 are the same equation except BM25 has a few additional hyperparameters (b and k). </div><div class="gmail_extra"><br></div><div class="gmail_extra">So is the advantage that BM25 provides for large diverse corpora with it? or is it marginal? Perhaps you can point me to some more examples where TFIDF is used (in supervised setting preferably) and I can plug in BM25 in place of TFIDF and see how it compares. Here are some I found:</div><div class="gmail_extra"><br></div><div class="gmail_extra"><div class="gmail_extra"><a href="http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html">http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html</a> <b>(supervised)</b><br></div><div class="gmail_extra"><a href="http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#example-text-document-clustering-py">http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#example-text-document-clustering-py</a> (<b>unsupervised)</b></div><div class="gmail_extra"><b><br></b></div><div class="gmail_extra">Thank you!<br>Basil </div></div><div class="gmail_extra"><br></div><div class="gmail_extra">PS: By the way, I'm not familiar with the delta-idf transform that Pavel mentions in the archive you linked, I'll have to delve deeper into that. I agree with the response to Pavel that he should be putting it in a separate class, not adding on to the TFIDF. I think it would take me about 6-8 weeks to adapt my code to the fit transform model and submit a pull request. </div><div class="gmail_extra"><br></div><div class="gmail_extra"><br></div><div class="gmail_extra"><br></div><div class="gmail_extra"><br></div><div class="gmail_extra"><br></div><div class="gmail_extra"><br></div></div>