The exact formula used to compute the tf-idf
Hi, I am trying to understand the exact formula for tf-idf. vectorizer = TfidfVectorizer(ngram_range = (1, 1), norm = None) wordtfidf = vectorizer.fit_transform(texts) Given the following 3 documents (id1, id2, id3 are the IDs of the three documents). id1 AA BB BB CC CC CC id2 AA AA AA AA BB BB BB BB BB DD DD DD DD DD DD id3 AA AA AA AA AA AA AA DD DD DD DD DD DD DD DD FF FF FF FF FF FF FF FF FF The results are the following. id1▸ cc▸ 5.079441541679836¬ id1▸ bb▸ 2.5753641449035616¬ id1▸ aa▸ 1.0¬ id2▸ dd▸ 7.726092434710685¬ id2▸ bb▸ 6.438410362258904¬ id2▸ aa▸ 4.0¬ id3▸ ff▸ 15.238324625039509¬ id3▸ dd▸ 10.301456579614246¬ id3▸ aa▸ 7.0¬ According to "6.2.3.4. Tf–idf term weighting" on the following page. https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature... For aa, as n = 3 and df =3, idf(aa) = log((1+n)/(1+df)) + 1 = 1. But I don't understand why tf-idf(id1, aa) is 1. This means that tf(id1, aa) is 1, which is just the count of aa, shouldn't it be divided by the number of terms in the doc id1, which should result in 1/6 instead of 1? Thanks. -- Regards, Peng
Hi there, unfortunately I currently don't have time to walk through your example, but I wrote down how the Tf-idf in sklearn works using some examples here: https://github.com/rasbt/pattern_classification/blob/90710922e4f4d7e3f432221... (I remember that we used it to write portions of the documentation in sklearn later) Best, Sebastian
On Feb 1, 2020, at 12:53 PM, Peng Yu <pengyu.ut@gmail.com> wrote:
Hi,
I am trying to understand the exact formula for tf-idf.
vectorizer = TfidfVectorizer(ngram_range = (1, 1), norm = None) wordtfidf = vectorizer.fit_transform(texts)
Given the following 3 documents (id1, id2, id3 are the IDs of the three documents).
id1 AA BB BB CC CC CC id2 AA AA AA AA BB BB BB BB BB DD DD DD DD DD DD id3 AA AA AA AA AA AA AA DD DD DD DD DD DD DD DD FF FF FF FF FF FF FF FF FF
The results are the following.
id1▸ cc▸ 5.079441541679836¬ id1▸ bb▸ 2.5753641449035616¬ id1▸ aa▸ 1.0¬ id2▸ dd▸ 7.726092434710685¬ id2▸ bb▸ 6.438410362258904¬ id2▸ aa▸ 4.0¬ id3▸ ff▸ 15.238324625039509¬ id3▸ dd▸ 10.301456579614246¬ id3▸ aa▸ 7.0¬
According to "6.2.3.4. Tf–idf term weighting" on the following page.
https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature...
For aa, as n = 3 and df =3, idf(aa) = log((1+n)/(1+df)) + 1 = 1.
But I don't understand why tf-idf(id1, aa) is 1. This means that tf(id1, aa) is 1, which is just the count of aa, shouldn't it be divided by the number of terms in the doc id1, which should result in 1/6 instead of 1?
Thanks.
-- Regards, Peng _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
participants (2)
-
Peng Yu -
Sebastian Raschka