[scikit-learn] TF-IDF

Wed Sep 27 07:53:08 EDT 2017

Hello,

Could anybody tell me the difference between using augmented frequency
(which is used for weighting term frequencies to eliminate the bias towards
larger documents) and cosine normalization (l2 norm which scikit-learn uses
for TfidfTransformer).
Augmented frequency is given by the following equation. It tries to divide
the natural term frequency by the maximum frequency of any term in the
document.

[image: Inline image 1]

Do they both do the same thing when it comes to eliminating bias towards
larger documents? I suppose scikit-learn uses the natural term freq, and
using cosine normalization is enabled with using norm=l2

Any help would be appreciated!

- Apurva
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170927/3c32043d/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 3602 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170927/3c32043d/attachment.png>