[scikit-learn] CountVectorizer: Additional Feature Suggestion
Roman Yurchak
rth.yurchak at gmail.com
Tue Jan 30 14:33:42 EST 2018
Hi Yacine,
On 29/01/18 16:39, Yacine MAZARI wrote:
> >> I wouldn't hate if length normalisation was added to
> if it was shown that normalising before IDF
> multiplication was more effective than (or complementary >> to) norming
> afterwards.
> I think this is one of the most important points here.
> Though not a formal proof, I can for example refer to:
>
> * NLTK
> <http://www.nltk.org/_modules/nltk/text.html#TextCollection.tf>,
> which is using document-length-normalized term frequencies.
>
> * Manning and Schütze's Introduction to Information Retrieval
> <https://nlp.stanford.edu/IR-book/html/htmledition/vector-space-classification-1.html>:
> "The same considerations that led us to prefer weighted
> representations, in particular length-normalized tf-idf
> representations, in Chapters 6 7 also apply here."
I believe the conclusion of the Manning's Chapter 6 is the following
table with TF-IDF weighting schemes
https://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html
in which the document length normalization is applied _after_ the IDF.
So "length-normalized tf-idf" is just TfidfVectorizer with norm='l1' as
previously mentioned (at least, if you measure the document length as
the number of words it contains).
More generally a weighting & normalization transformer for some of the
other configurations in that table is implemented in
http://freediscovery.io/doc/stable/python/generated/freediscovery.feature_weighting.SmartTfidfTransformer.html
With respect to the NLTK implementation, see
https://github.com/nltk/nltk/pull/979#issuecomment-102296527
So I don't think there is a need to change anything in TfidfTransformer...
--
Roman
More information about the scikit-learn
mailing list