[scikit-learn] CountVectorizer: Additional Feature Suggestion
Yacine MAZARI
y.mazari at gmail.com
Sun Jan 28 00:59:12 EST 2018
Hello,
I would like to work on adding an additional feature to
"sklearn.feature_extraction.text.CountVectorizer".
In the current implementation, the definition of term frequency is the
number of times a term t occurs in document d.
However, another definition that is very commonly used in practice is the term
frequency adjusted for document length
<https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2>, i.e: tf =
raw counts / document length.
I intend to implement this by adding an additional boolean parameter
"relative_frequency" to the constructor of CountVectorizer.
If the parameter is true, normalize X by document length (along x=1) in
"CountVectorizer.fit_transform()".
What do you think?
If this sounds reasonable an worth it, I will send a PR.
Thank you,
Yacine.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180128/59a3d66e/attachment.html>
More information about the scikit-learn
mailing list