[scikit-learn] CountVectorizer: Additional Feature Suggestion

Jacob Vanderplas jakevdp at cs.washington.edu
Sun Jan 28 01:11:08 EST 2018


Hi Yacine,
If I'm understanding you correctly, I think what you have in mind is
already implemented in scikit-learn in the TF-IDF vectorizer
<http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html>
.

Best,
   Jake

 Jake VanderPlas
 Senior Data Science Fellow
 Director of Open Software
 University of Washington eScience Institute

On Sat, Jan 27, 2018 at 9:59 PM, Yacine MAZARI <y.mazari at gmail.com> wrote:

> Hello,
>
> I would like to work on adding an additional feature to
> "sklearn.feature_extraction.text.CountVectorizer".
>
> In the current implementation, the definition of term frequency is the
> number of times a term t occurs in document d.
>
> However, another definition that is very commonly used in practice is the term
> frequency adjusted for document length
> <https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2>, i.e: tf
> = raw counts / document length.
>
> I intend to implement this by adding an additional boolean parameter
> "relative_frequency" to the constructor of CountVectorizer.
> If the parameter is true, normalize X by document length (along x=1) in
> "CountVectorizer.fit_transform()".
>
> What do you think?
> If this sounds reasonable an worth it, I will send a PR.
>
> Thank you,
> Yacine.
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180127/a058e7c3/attachment.html>


More information about the scikit-learn mailing list