[scikit-learn] CountVectorizer: Additional Feature Suggestion

Yacine MAZARI y.mazari at gmail.com
Sun Jan 28 00:59:12 EST 2018


Hello,

I would like to work on adding an additional feature to
"sklearn.feature_extraction.text.CountVectorizer".

In the current implementation, the definition of term frequency is the
number of times a term t occurs in document d.

However, another definition that is very commonly used in practice is the term
frequency adjusted for document length
<https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2>, i.e: tf =
raw counts / document length.

I intend to implement this by adding an additional boolean parameter
"relative_frequency" to the constructor of CountVectorizer.
If the parameter is true, normalize X by document length (along x=1) in
"CountVectorizer.fit_transform()".

What do you think?
If this sounds reasonable an worth it, I will send a PR.

Thank you,
Yacine.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180128/59a3d66e/attachment.html>


More information about the scikit-learn mailing list