[scikit-learn] TF-IDF

Roman Yurchak rth.yurchak at gmail.com
Mon Oct 2 03:28:37 EDT 2017


Hi Apurva,

if you consider the operations done by the augmented frequency and the 
cosine normalization independently from everything else, they are 
somewhat similar. The normalization by max in a p-norm with p→+∞ . So 
apart from the 0.5 offset, both are can be seen document length 
normalization with a different p value.

However, in TF-IDF you you would typically have an IDF document 
weighting operation between the term frequency weighting and the 
normalization, in which case the effect of both will be quite different. 
Generally I find that the SMART IR notation is very useful to represent 
different phases of the TF-IDF transformation.

The default parameters of TfidfTransformer is a good choice that will 
work well in most cases. Also, depending on the algorithm that you use 
afterwards, not having your data normalized by a an actual norm (e.g. 
cosine) may be sub-optimal.  Still, if you want to fine tune your 
document normalization have a look at the "Pivoted Document Length 
Normalization" paper by Singhal et al. There is a compatible 
implementation of this and a few other TF-IDF schemes in 
http://freediscovery.io/doc/stable/python/generated/freediscovery.feature_weighting.SmartTfidfTransformer.html

In the end, it's probably easier to try different options on your 
dataset to see what works and what doesn't. You could just determine it 
by cross-validating..

-- 
Roman

On 27/09/17 13:53, Apurva Nandan wrote:
> Hello,
>
> Could anybody tell me the difference between using augmented frequency
> (which is used for weighting term frequencies to eliminate the bias
> towards larger documents) and cosine normalization (l2 norm which
> scikit-learn uses for TfidfTransformer).
> Augmented frequency is given by the following equation. It tries to
> divide the natural term frequency by the maximum frequency of any term
> in the document.
>
> Inline image 1
>
> Do they both do the same thing when it comes to eliminating bias towards
> larger documents? I suppose scikit-learn uses the natural term freq, and
> using cosine normalization is enabled with using norm=l2
>
> Any help would be appreciated!
>
> - Apurva
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>



More information about the scikit-learn mailing list