why the modification in the df-idf formula?
Hi guys, I'd like to understand why sklearn's implementation of tf-idf is different from the standard textbook notation as described in the docs: https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-w... Do you have any reference that I could take a look at? I didn't manage to find them in the docs, maybe I missed something? Thank you! Best wishes Sole
Hi Sole, It’s been a long time, but I remember helping with drafting the Tf-idf text in the documentation as part of a scikit-learn sprint at SciPy a looong time ago where I mentioned this difference (since it initially surprised me, because I couldn’t get it to match my from-scratch implementation). As far as I remember, the sklearn version addressed some instability issues for certain edge cases. I am not sure if that helps, but I have briefly compared the textbook vs the sklearn tf-idf here: https://github.com/rasbt/machine-learning-book/blob/main/ch08/ch08.ipynb Best, Sebastian -- Sebastian Raschka, PhD Machine learning and AI researcher, https://sebastianraschka.com Staff Research Engineer at Lightning AI, https://lightning.ai On May 28, 2024 at 9:43 AM -0500, Sole Galli via scikit-learn <scikit-learn@python.org>, wrote:
Hi guys,
I'd like to understand why sklearn's implementation of tf-idf is different from the standard textbook notation as described in the docs: https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-w...
Do you have any reference that I could take a look at? I didn't manage to find them in the docs, maybe I missed something?
Thank you!
Best wishes Sole _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Hi Sebastian, Thank you so much for sending the link. So, by the looks of it, the modification is introduced so that we start weighting at 0 (or 1 after adding the plus 1 to the result of the log) those words that appear in all documents. Otherwise, they'd receive a negative value. Thank you! Best Sole On Tuesday, May 28th, 2024 at 4:52 PM, Sebastian Raschka <mail@sebastianraschka.com> wrote:
Hi Sole,
It’s been a long time, but I remember helping with drafting the Tf-idf text in the documentation as part of a scikit-learn sprint at SciPy a looong time ago where I mentioned this difference (since it initially surprised me, because I couldn’t get it to match my from-scratch implementation). As far as I remember, the sklearn version addressed some instability issues for certain edge cases.
I am not sure if that helps, but I have briefly compared the textbook vs the sklearn tf-idf here: https://github.com/rasbt/machine-learning-book/blob/main/ch08/ch08.ipynb
Best, Sebastian
-- Sebastian Raschka, PhD Machine learning and AI researcher, [https://sebastianraschka.com](https://sebastianraschka.com/)
Staff Research Engineer at Lightning AI, https://lightning.ai
On May 28, 2024 at 9:43 AM -0500, Sole Galli via scikit-learn <scikit-learn@python.org>, wrote:
Hi guys,
I'd like to understand why sklearn's implementation of tf-idf is different from the standard textbook notation as described in the docs: https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-w...
Do you have any reference that I could take a look at? I didn't manage to find them in the docs, maybe I missed something?
Thank you!
Best wishes Sole
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
participants (2)
-
Sebastian Raschka -
Sole Galli