[scikit-learn] TF-IDF
Roman Yurchak
rth.yurchak at gmail.com
Mon Oct 2 03:28:37 EDT 2017
Hi Apurva,
if you consider the operations done by the augmented frequency and the
cosine normalization independently from everything else, they are
somewhat similar. The normalization by max in a p-norm with p→+∞ . So
apart from the 0.5 offset, both are can be seen document length
normalization with a different p value.
However, in TF-IDF you you would typically have an IDF document
weighting operation between the term frequency weighting and the
normalization, in which case the effect of both will be quite different.
Generally I find that the SMART IR notation is very useful to represent
different phases of the TF-IDF transformation.
The default parameters of TfidfTransformer is a good choice that will
work well in most cases. Also, depending on the algorithm that you use
afterwards, not having your data normalized by a an actual norm (e.g.
cosine) may be sub-optimal. Still, if you want to fine tune your
document normalization have a look at the "Pivoted Document Length
Normalization" paper by Singhal et al. There is a compatible
implementation of this and a few other TF-IDF schemes in
http://freediscovery.io/doc/stable/python/generated/freediscovery.feature_weighting.SmartTfidfTransformer.html
In the end, it's probably easier to try different options on your
dataset to see what works and what doesn't. You could just determine it
by cross-validating..
--
Roman
On 27/09/17 13:53, Apurva Nandan wrote:
> Hello,
>
> Could anybody tell me the difference between using augmented frequency
> (which is used for weighting term frequencies to eliminate the bias
> towards larger documents) and cosine normalization (l2 norm which
> scikit-learn uses for TfidfTransformer).
> Augmented frequency is given by the following equation. It tries to
> divide the natural term frequency by the maximum frequency of any term
> in the document.
>
> Inline image 1
>
> Do they both do the same thing when it comes to eliminating bias towards
> larger documents? I suppose scikit-learn uses the natural term freq, and
> using cosine normalization is enabled with using norm=l2
>
> Any help would be appreciated!
>
> - Apurva
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
More information about the scikit-learn
mailing list