[scikit-learn] CountVectorizer: Additional Feature Suggestion

Sun Jan 28 04:29:58 EST 2018

sklearn.preprocessing.Normalizer allows you to normalize any vector by its
L1 or L2 norm. L1 would be equivalent to "document length" as long as you
did not intend to count stop words in the length.
sklearn.feature_extraction.text.TfidfTransformer offers similar norming,
but does so only after accounting for IDF or TF transformation. Since the
length normalisation transformation is stateless, it can also be computed
with a sklearn.preprocessing.FunctionTransformer.

I can't say it's especially obvious that these features available, and
improvements to the documentation are welcome, but CountVectorizer is
complicated enough and we would rather avoid more parameters if we can. I
wouldn't hate if length normalisation was added to TfidfTransformer, if it
was shown that normalising before IDF multiplication was more effective
than (or complementary to) norming afterwards.

On 28 January 2018 at 18:31, Yacine MAZARI <y.mazari at gmail.com> wrote:

> Hi Jake,
>
> Thanks for the quick reply.
>
> What I meant is different from the TfIdfVectorizer. Let me clarify:
>
> In the TfIdfVectorizer, the raw counts are multiplied by IDF, which
> badically means normalizing the counts by document frequencies, tf * idf.
> But still, tf is deined here as the raw count of a term in the dicument.
>
> What I am suggesting, is to add the possibility to use another definition
> of tf, tf= relative frequency of a term in a document = raw counts /
> document length.
> On top of this, one could further normalize by IDF to get the TF-IDF (
> https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2).
>
> When can this be useful? Here is an example:
> Say term t occurs 5 times in document d1, and also 5 times in document d2.
> At first glance, it seems that the term conveys the same information about
> both documents. But if we also check document lengths, and find that length
> of d1 is 20, wheras lenght of d2 is 200, then probably the “importance” and
> information carried by the same term in the two documents is not the same.
> If we use relative frequency instead of absolute counts, then tf1=5/20=0.4
> whereas tf2=5/200=0.04.
>
> There are many practical cases (document similarity, document
> classification, etc...) where using relative frequencies yields better
> results, and it might be worth making the CountVectorizer support this.
>
> Regards,
> Yacine.
>
> On Sun, Jan 28, 2018 at 15:12 Jacob Vanderplas <jakevdp at cs.washington.edu>
> wrote:
>
>> Hi Yacine,
>> If I'm understanding you correctly, I think what you have in mind is
>> already implemented in scikit-learn in the TF-IDF vectorizer
>> <http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html>
>> .
>>
>> Best,
>>    Jake
>>
>>  Jake VanderPlas
>>  Senior Data Science Fellow
>>  Director of Open Software
>>  University of Washington eScience Institute
>>
>> On Sat, Jan 27, 2018 at 9:59 PM, Yacine MAZARI <y.mazari at gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> I would like to work on adding an additional feature to
>>> "sklearn.feature_extraction.text.CountVectorizer".
>>>
>>> In the current implementation, the definition of term frequency is the
>>> number of times a term t occurs in document d.
>>>
>>> However, another definition that is very commonly used in practice is
>>> the term frequency adjusted for document length
>>> <https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2>, i.e:
>>> tf = raw counts / document length.
>>>
>>> I intend to implement this by adding an additional boolean parameter
>>> "relative_frequency" to the constructor of CountVectorizer.
>>> If the parameter is true, normalize X by document length (along x=1) in
>>> "CountVectorizer.fit_transform()".
>>>
>>> What do you think?
>>> If this sounds reasonable an worth it, I will send a PR.
>>>
>>> Thank you,
>>> Yacine.
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180128/4592ffae/attachment-0001.html>