[scikit-learn] CountVectorizer: Additional Feature Suggestion

Tue Jan 30 14:33:42 EST 2018

Hi Yacine,

On 29/01/18 16:39, Yacine MAZARI wrote:
>  >> I wouldn't hate if length normalisation was added to 
>  if it was shown that normalising before IDF 
> multiplication was more effective than (or complementary >> to) norming 
> afterwards.
> I think this is one of the most important points here.
> Though not a formal proof, I can for example refer to:
> 
>   * NLTK
>     <http://www.nltk.org/_modules/nltk/text.html#TextCollection.tf>,
>     which is using document-length-normalized term frequencies.
> 
>   * Manning and Schütze's Introduction to Information Retrieval
>     <https://nlp.stanford.edu/IR-book/html/htmledition/vector-space-classification-1.html>:
>     "The same considerations that led us to prefer weighted
>     representations, in particular length-normalized tf-idf
>     representations, in Chapters 6   7 also apply here." 

I believe the conclusion of the Manning's Chapter 6 is the following 
table with TF-IDF weighting schemes 
https://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html 
in which the document length normalization is applied _after_ the IDF. 
So "length-normalized tf-idf" is just TfidfVectorizer with norm='l1' as 
previously mentioned (at least, if you measure the document length as 
the number of words it contains).
More generally a weighting & normalization transformer for some of the 
other configurations in that table is implemented in

http://freediscovery.io/doc/stable/python/generated/freediscovery.feature_weighting.SmartTfidfTransformer.html

With respect to the NLTK implementation, see 
https://github.com/nltk/nltk/pull/979#issuecomment-102296527

So I don't think there is a need to change anything in TfidfTransformer...

-- 
Roman