CountVectorizer: Additional Feature Suggestion
Hello, I would like to work on adding an additional feature to "sklearn.feature_extraction.text.CountVectorizer". In the current implementation, the definition of term frequency is the number of times a term t occurs in document d. However, another definition that is very commonly used in practice is the term frequency adjusted for document length <https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2>, i.e: tf = raw counts / document length. I intend to implement this by adding an additional boolean parameter "relative_frequency" to the constructor of CountVectorizer. If the parameter is true, normalize X by document length (along x=1) in "CountVectorizer.fit_transform()". What do you think? If this sounds reasonable an worth it, I will send a PR. Thank you, Yacine.
Hi Yacine, If I'm understanding you correctly, I think what you have in mind is already implemented in scikit-learn in the TF-IDF vectorizer <http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction....> . Best, Jake Jake VanderPlas Senior Data Science Fellow Director of Open Software University of Washington eScience Institute On Sat, Jan 27, 2018 at 9:59 PM, Yacine MAZARI <y.mazari@gmail.com> wrote:
Hello,
I would like to work on adding an additional feature to "sklearn.feature_extraction.text.CountVectorizer".
In the current implementation, the definition of term frequency is the number of times a term t occurs in document d.
However, another definition that is very commonly used in practice is the term frequency adjusted for document length <https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2>, i.e: tf = raw counts / document length.
I intend to implement this by adding an additional boolean parameter "relative_frequency" to the constructor of CountVectorizer. If the parameter is true, normalize X by document length (along x=1) in "CountVectorizer.fit_transform()".
What do you think? If this sounds reasonable an worth it, I will send a PR.
Thank you, Yacine.
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Hi Jake, Thanks for the quick reply. What I meant is different from the TfIdfVectorizer. Let me clarify: In the TfIdfVectorizer, the raw counts are multiplied by IDF, which badically means normalizing the counts by document frequencies, tf * idf. But still, tf is deined here as the raw count of a term in the dicument. What I am suggesting, is to add the possibility to use another definition of tf, tf= relative frequency of a term in a document = raw counts / document length. On top of this, one could further normalize by IDF to get the TF-IDF ( https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2). When can this be useful? Here is an example: Say term t occurs 5 times in document d1, and also 5 times in document d2. At first glance, it seems that the term conveys the same information about both documents. But if we also check document lengths, and find that length of d1 is 20, wheras lenght of d2 is 200, then probably the “importance” and information carried by the same term in the two documents is not the same. If we use relative frequency instead of absolute counts, then tf1=5/20=0.4 whereas tf2=5/200=0.04. There are many practical cases (document similarity, document classification, etc...) where using relative frequencies yields better results, and it might be worth making the CountVectorizer support this. Regards, Yacine. On Sun, Jan 28, 2018 at 15:12 Jacob Vanderplas <jakevdp@cs.washington.edu> wrote:
Hi Yacine, If I'm understanding you correctly, I think what you have in mind is already implemented in scikit-learn in the TF-IDF vectorizer <http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction....> .
Best, Jake
Jake VanderPlas Senior Data Science Fellow Director of Open Software University of Washington eScience Institute
On Sat, Jan 27, 2018 at 9:59 PM, Yacine MAZARI <y.mazari@gmail.com> wrote:
Hello,
I would like to work on adding an additional feature to "sklearn.feature_extraction.text.CountVectorizer".
In the current implementation, the definition of term frequency is the number of times a term t occurs in document d.
However, another definition that is very commonly used in practice is the term frequency adjusted for document length <https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2>, i.e: tf = raw counts / document length.
I intend to implement this by adding an additional boolean parameter "relative_frequency" to the constructor of CountVectorizer. If the parameter is true, normalize X by document length (along x=1) in "CountVectorizer.fit_transform()".
What do you think? If this sounds reasonable an worth it, I will send a PR.
Thank you, Yacine.
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
sklearn.preprocessing.Normalizer allows you to normalize any vector by its L1 or L2 norm. L1 would be equivalent to "document length" as long as you did not intend to count stop words in the length. sklearn.feature_extraction.text.TfidfTransformer offers similar norming, but does so only after accounting for IDF or TF transformation. Since the length normalisation transformation is stateless, it can also be computed with a sklearn.preprocessing.FunctionTransformer. I can't say it's especially obvious that these features available, and improvements to the documentation are welcome, but CountVectorizer is complicated enough and we would rather avoid more parameters if we can. I wouldn't hate if length normalisation was added to TfidfTransformer, if it was shown that normalising before IDF multiplication was more effective than (or complementary to) norming afterwards. On 28 January 2018 at 18:31, Yacine MAZARI <y.mazari@gmail.com> wrote:
Hi Jake,
Thanks for the quick reply.
What I meant is different from the TfIdfVectorizer. Let me clarify:
In the TfIdfVectorizer, the raw counts are multiplied by IDF, which badically means normalizing the counts by document frequencies, tf * idf. But still, tf is deined here as the raw count of a term in the dicument.
What I am suggesting, is to add the possibility to use another definition of tf, tf= relative frequency of a term in a document = raw counts / document length. On top of this, one could further normalize by IDF to get the TF-IDF ( https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2).
When can this be useful? Here is an example: Say term t occurs 5 times in document d1, and also 5 times in document d2. At first glance, it seems that the term conveys the same information about both documents. But if we also check document lengths, and find that length of d1 is 20, wheras lenght of d2 is 200, then probably the “importance” and information carried by the same term in the two documents is not the same. If we use relative frequency instead of absolute counts, then tf1=5/20=0.4 whereas tf2=5/200=0.04.
There are many practical cases (document similarity, document classification, etc...) where using relative frequencies yields better results, and it might be worth making the CountVectorizer support this.
Regards, Yacine.
On Sun, Jan 28, 2018 at 15:12 Jacob Vanderplas <jakevdp@cs.washington.edu> wrote:
Hi Yacine, If I'm understanding you correctly, I think what you have in mind is already implemented in scikit-learn in the TF-IDF vectorizer <http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction....> .
Best, Jake
Jake VanderPlas Senior Data Science Fellow Director of Open Software University of Washington eScience Institute
On Sat, Jan 27, 2018 at 9:59 PM, Yacine MAZARI <y.mazari@gmail.com> wrote:
Hello,
I would like to work on adding an additional feature to "sklearn.feature_extraction.text.CountVectorizer".
In the current implementation, the definition of term frequency is the number of times a term t occurs in document d.
However, another definition that is very commonly used in practice is the term frequency adjusted for document length <https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2>, i.e: tf = raw counts / document length.
I intend to implement this by adding an additional boolean parameter "relative_frequency" to the constructor of CountVectorizer. If the parameter is true, normalize X by document length (along x=1) in "CountVectorizer.fit_transform()".
What do you think? If this sounds reasonable an worth it, I will send a PR.
Thank you, Yacine.
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
On Sun, Jan 28, 2018 at 08:29:58PM +1100, Joel Nothman wrote:
I can't say it's especially obvious that these features available, and improvements to the documentation are welcome, but CountVectorizer is complicated enough and we would rather avoid more parameters if we can.
Same feeling here. I am afraid of the crowing effect that makes it harder and harder to find things as we add them. Gaël
Hi Folks, Thank you all for the feedback and interesting discussion. I do realize that adding a feature comes with risks, and that there should really be compelling reasons to do so. Let me try to address your comments here, and make one final case for the value of this feature: 1) Use Normalizer, FunctionTransformer (or write a custom code) to perform normalization of CountVectorizer result: That would require an additional pass on the data. True that's "only" O(N), but if there is a way to speed up training an ML model, that'd be an advantage. 2) TfidfVectorizer(use_idf=False, norm='l1'): Yes, that would have the same effect; but not that this not TF-IDF any more, in that TF-IDF is a two-fold normalization. If one needs TF-IDF (with normalized document counts), then 2 additional passes on the data (with TfidfVectorizer(use_idf=True)) would be required to get IDF normalization, bringing us to a case similar to the above. 3)
I wouldn't hate if length normalisation was added to TfidfTransformer, if it was shown that normalising before IDF multiplication was more effective than (or complementary >> to) norming afterwards. I think this is one of the most important points here. Though not a formal proof, I can for example refer to:
- NLTK <http://www.nltk.org/_modules/nltk/text.html#TextCollection.tf>, which is using document-length-normalized term frequencies. - Manning and Schütze's Introduction to Information Retrieval <https://nlp.stanford.edu/IR-book/html/htmledition/vector-space-classificatio...>: "The same considerations that led us to prefer weighted representations, in particular length-normalized tf-idf representations, in Chapters 6 <https://nlp.stanford.edu/IR-book/html/htmledition/scoring-term-weighting-and...> 7 <https://nlp.stanford.edu/IR-book/html/htmledition/computing-scores-in-a-comp...> also apply here." On the other hand, applying this kind of normalization to a corpus where the document lengths are similar (such as tweets) will probably not be of any advantage. 4) This will be a handy feature as Sebastian mentioned, and the code change will be very small (careful here...any code change brings risks). What do you think? Best regards, Yacine.
I don't think you will do this without an O(N) cost. The fact that it's done with a second pass is moot. My position stands: if this change happens, it should be to TfidfTransformer (which should perhaps be called something like CountVectorWeighter!) alone. On 30 January 2018 at 02:39, Yacine MAZARI <y.mazari@gmail.com> wrote:
Hi Folks,
Thank you all for the feedback and interesting discussion.
I do realize that adding a feature comes with risks, and that there should really be compelling reasons to do so.
Let me try to address your comments here, and make one final case for the value of this feature:
1) Use Normalizer, FunctionTransformer (or write a custom code) to perform normalization of CountVectorizer result: That would require an additional pass on the data. True that's "only" O(N), but if there is a way to speed up training an ML model, that'd be an advantage.
2) TfidfVectorizer(use_idf=False, norm='l1'): Yes, that would have the same effect; but not that this not TF-IDF any more, in that TF-IDF is a two-fold normalization. If one needs TF-IDF (with normalized document counts), then 2 additional passes on the data (with TfidfVectorizer(use_idf=True)) would be required to get IDF normalization, bringing us to a case similar to the above.
3)
I wouldn't hate if length normalisation was added to TfidfTransformer, if it was shown that normalising before IDF multiplication was more effective than (or complementary >> to) norming afterwards. I think this is one of the most important points here. Though not a formal proof, I can for example refer to:
- NLTK <http://www.nltk.org/_modules/nltk/text.html#TextCollection.tf>, which is using document-length-normalized term frequencies.
- Manning and Schütze's Introduction to Information Retrieval <https://nlp.stanford.edu/IR-book/html/htmledition/vector-space-classificatio...>: "The same considerations that led us to prefer weighted representations, in particular length-normalized tf-idf representations, in Chapters 6 <https://nlp.stanford.edu/IR-book/html/htmledition/scoring-term-weighting-and...> 7 <https://nlp.stanford.edu/IR-book/html/htmledition/computing-scores-in-a-comp...> also apply here."
On the other hand, applying this kind of normalization to a corpus where the document lengths are similar (such as tweets) will probably not be of any advantage.
4) This will be a handy feature as Sebastian mentioned, and the code change will be very small (careful here...any code change brings risks).
What do you think?
Best regards, Yacine.
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Okay, thanks for the replies. @Joel: Should I go ahead and send a PR with the change to TfidfTransformer? On Tue, Jan 30, 2018 at 5:27 AM, Joel Nothman <joel.nothman@gmail.com> wrote:
I don't think you will do this without an O(N) cost. The fact that it's done with a second pass is moot.
My position stands: if this change happens, it should be to TfidfTransformer (which should perhaps be called something like CountVectorWeighter!) alone.
On 30 January 2018 at 02:39, Yacine MAZARI <y.mazari@gmail.com> wrote:
Hi Folks,
Thank you all for the feedback and interesting discussion.
I do realize that adding a feature comes with risks, and that there should really be compelling reasons to do so.
Let me try to address your comments here, and make one final case for the value of this feature:
1) Use Normalizer, FunctionTransformer (or write a custom code) to perform normalization of CountVectorizer result: That would require an additional pass on the data. True that's "only" O(N), but if there is a way to speed up training an ML model, that'd be an advantage.
2) TfidfVectorizer(use_idf=False, norm='l1'): Yes, that would have the same effect; but not that this not TF-IDF any more, in that TF-IDF is a two-fold normalization. If one needs TF-IDF (with normalized document counts), then 2 additional passes on the data (with TfidfVectorizer(use_idf=True)) would be required to get IDF normalization, bringing us to a case similar to the above.
3)
I wouldn't hate if length normalisation was added to TfidfTransformer, if it was shown that normalising before IDF multiplication was more effective than (or complementary >> to) norming afterwards. I think this is one of the most important points here. Though not a formal proof, I can for example refer to:
- NLTK <http://www.nltk.org/_modules/nltk/text.html#TextCollection.tf>, which is using document-length-normalized term frequencies.
- Manning and Schütze's Introduction to Information Retrieval <https://nlp.stanford.edu/IR-book/html/htmledition/vector-space-classificatio...>: "The same considerations that led us to prefer weighted representations, in particular length-normalized tf-idf representations, in Chapters 6 <https://nlp.stanford.edu/IR-book/html/htmledition/scoring-term-weighting-and...> 7 <https://nlp.stanford.edu/IR-book/html/htmledition/computing-scores-in-a-comp...> also apply here."
On the other hand, applying this kind of normalization to a corpus where the document lengths are similar (such as tweets) will probably not be of any advantage.
4) This will be a handy feature as Sebastian mentioned, and the code change will be very small (careful here...any code change brings risks).
What do you think?
Best regards, Yacine.
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Hi Yacine, On 29/01/18 16:39, Yacine MAZARI wrote:
I wouldn't hate if length normalisation was added to if it was shown that normalising before IDF multiplication was more effective than (or complementary >> to) norming afterwards. I think this is one of the most important points here. Though not a formal proof, I can for example refer to:
* NLTK <http://www.nltk.org/_modules/nltk/text.html#TextCollection.tf>, which is using document-length-normalized term frequencies.
* Manning and Schütze's Introduction to Information Retrieval <https://nlp.stanford.edu/IR-book/html/htmledition/vector-space-classificatio...>: "The same considerations that led us to prefer weighted representations, in particular length-normalized tf-idf representations, in Chapters 6 7 also apply here."
I believe the conclusion of the Manning's Chapter 6 is the following table with TF-IDF weighting schemes https://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighti... in which the document length normalization is applied _after_ the IDF. So "length-normalized tf-idf" is just TfidfVectorizer with norm='l1' as previously mentioned (at least, if you measure the document length as the number of words it contains). More generally a weighting & normalization transformer for some of the other configurations in that table is implemented in http://freediscovery.io/doc/stable/python/generated/freediscovery.feature_we... With respect to the NLTK implementation, see https://github.com/nltk/nltk/pull/979#issuecomment-102296527 So I don't think there is a need to change anything in TfidfTransformer... -- Roman
A very good point! (Although augmented and log-average tf both do some kind of normalisation of the tf distribution before IDF weighting.)
Good point Joel, and I actually forgot that you can set the norm param in the TfidfVectorizer, so one could basically do vect = TfidfVectorizer(use_idf=False, norm='l1') to have the CountVectorizer behavior but normalizing by the document length. Best, Sebastian
On Jan 28, 2018, at 1:29 AM, Joel Nothman <joel.nothman@gmail.com> wrote:
sklearn.preprocessing.Normalizer allows you to normalize any vector by its L1 or L2 norm. L1 would be equivalent to "document length" as long as you did not intend to count stop words in the length. sklearn.feature_extraction.text.TfidfTransformer offers similar norming, but does so only after accounting for IDF or TF transformation. Since the length normalisation transformation is stateless, it can also be computed with a sklearn.preprocessing.FunctionTransformer.
I can't say it's especially obvious that these features available, and improvements to the documentation are welcome, but CountVectorizer is complicated enough and we would rather avoid more parameters if we can. I wouldn't hate if length normalisation was added to TfidfTransformer, if it was shown that normalising before IDF multiplication was more effective than (or complementary to) norming afterwards.
On 28 January 2018 at 18:31, Yacine MAZARI <y.mazari@gmail.com> wrote: Hi Jake,
Thanks for the quick reply.
What I meant is different from the TfIdfVectorizer. Let me clarify:
In the TfIdfVectorizer, the raw counts are multiplied by IDF, which badically means normalizing the counts by document frequencies, tf * idf. But still, tf is deined here as the raw count of a term in the dicument.
What I am suggesting, is to add the possibility to use another definition of tf, tf= relative frequency of a term in a document = raw counts / document length. On top of this, one could further normalize by IDF to get the TF-IDF (https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2).
When can this be useful? Here is an example: Say term t occurs 5 times in document d1, and also 5 times in document d2. At first glance, it seems that the term conveys the same information about both documents. But if we also check document lengths, and find that length of d1 is 20, wheras lenght of d2 is 200, then probably the “importance” and information carried by the same term in the two documents is not the same. If we use relative frequency instead of absolute counts, then tf1=5/20=0.4 whereas tf2=5/200=0.04.
There are many practical cases (document similarity, document classification, etc...) where using relative frequencies yields better results, and it might be worth making the CountVectorizer support this.
Regards, Yacine.
On Sun, Jan 28, 2018 at 15:12 Jacob Vanderplas <jakevdp@cs.washington.edu> wrote: Hi Yacine, If I'm understanding you correctly, I think what you have in mind is already implemented in scikit-learn in the TF-IDF vectorizer.
Best, Jake
Jake VanderPlas Senior Data Science Fellow Director of Open Software University of Washington eScience Institute
On Sat, Jan 27, 2018 at 9:59 PM, Yacine MAZARI <y.mazari@gmail.com> wrote: Hello,
I would like to work on adding an additional feature to "sklearn.feature_extraction.text.CountVectorizer".
In the current implementation, the definition of term frequency is the number of times a term t occurs in document d.
However, another definition that is very commonly used in practice is the term frequency adjusted for document length, i.e: tf = raw counts / document length.
I intend to implement this by adding an additional boolean parameter "relative_frequency" to the constructor of CountVectorizer. If the parameter is true, normalize X by document length (along x=1) in "CountVectorizer.fit_transform()".
What do you think? If this sounds reasonable an worth it, I will send a PR.
Thank you, Yacine.
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
That's equivalent to Normalizer(norm='l1') or FunctionTransformer(np.linalg.norm, kw_args={'axis': 1, 'ord': 1}). The problem is that length norm followed by TfidfTransformer now can't do sublinear TF right... But that's alright if we know we can always do FunctionTransformer(lambda X: calc_sublinear(X) / X.sum(axis=1)), perhaps then followed by applying IDF from TfidfTransformer. Yes, it's not straightforward, but it's very hard to provide a library that suits everyone's needs... so FunctionTransformer and Pipeline are your friends :) On 28 January 2018 at 20:36, Sebastian Raschka <se.raschka@gmail.com> wrote:
Good point Joel, and I actually forgot that you can set the norm param in the TfidfVectorizer, so one could basically do
vect = TfidfVectorizer(use_idf=False, norm='l1')
to have the CountVectorizer behavior but normalizing by the document length.
Best, Sebastian
On Jan 28, 2018, at 1:29 AM, Joel Nothman <joel.nothman@gmail.com> wrote:
sklearn.preprocessing.Normalizer allows you to normalize any vector by its L1 or L2 norm. L1 would be equivalent to "document length" as long as you did not intend to count stop words in the length. sklearn.feature_extraction.text.TfidfTransformer offers similar norming, but does so only after accounting for IDF or TF transformation. Since the length normalisation transformation is stateless, it can also be computed with a sklearn.preprocessing.FunctionTransformer.
I can't say it's especially obvious that these features available, and improvements to the documentation are welcome, but CountVectorizer is complicated enough and we would rather avoid more parameters if we can. I wouldn't hate if length normalisation was added to TfidfTransformer, if it was shown that normalising before IDF multiplication was more effective than (or complementary to) norming afterwards.
On 28 January 2018 at 18:31, Yacine MAZARI <y.mazari@gmail.com> wrote: Hi Jake,
Thanks for the quick reply.
What I meant is different from the TfIdfVectorizer. Let me clarify:
In the TfIdfVectorizer, the raw counts are multiplied by IDF, which badically means normalizing the counts by document frequencies, tf * idf. But still, tf is deined here as the raw count of a term in the dicument.
What I am suggesting, is to add the possibility to use another definition of tf, tf= relative frequency of a term in a document = raw counts / document length. On top of this, one could further normalize by IDF to get the TF-IDF ( https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2).
When can this be useful? Here is an example: Say term t occurs 5 times in document d1, and also 5 times in document d2. At first glance, it seems that the term conveys the same information about both documents. But if we also check document lengths, and find that length of d1 is 20, wheras lenght of d2 is 200, then probably the “importance” and information carried by the same term in the two documents is not the same. If we use relative frequency instead of absolute counts, then tf1=5/20=0.4 whereas tf2=5/200=0.04.
There are many practical cases (document similarity, document classification, etc...) where using relative frequencies yields better results, and it might be worth making the CountVectorizer support this.
Regards, Yacine.
On Sun, Jan 28, 2018 at 15:12 Jacob Vanderplas < jakevdp@cs.washington.edu> wrote: Hi Yacine, If I'm understanding you correctly, I think what you have in mind is already implemented in scikit-learn in the TF-IDF vectorizer.
Best, Jake
Jake VanderPlas Senior Data Science Fellow Director of Open Software University of Washington eScience Institute
On Sat, Jan 27, 2018 at 9:59 PM, Yacine MAZARI <y.mazari@gmail.com> wrote: Hello,
I would like to work on adding an additional feature to "sklearn.feature_extraction.text.CountVectorizer".
In the current implementation, the definition of term frequency is the number of times a term t occurs in document d.
However, another definition that is very commonly used in practice is the term frequency adjusted for document length, i.e: tf = raw counts / document length.
I intend to implement this by adding an additional boolean parameter "relative_frequency" to the constructor of CountVectorizer. If the parameter is true, normalize X by document length (along x=1) in "CountVectorizer.fit_transform()".
What do you think? If this sounds reasonable an worth it, I will send a PR.
Thank you, Yacine.
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Hi, Yacine, Just on a side note, you can set idf=False in the Tfidf and only normalize the vectors by their L2 norm. But yeah, the normalization you suggest might be really handy in certain cases. I am not sure though if it's worth making this another parameter in the CountVectorizer (which already has quite a lot of parameters), as it can be computed quite easily if I am not misinterpreting something. Since the length of each document is determined by the sum of the words in each vector, one could simply normalize it by the document length as follows:
from sklearn.feature_extraction.text import CountVectorizer dataset = ['The sun is shining and the weather is sweet', 'Hello World. The sun is shining and the weather is sweet']
vect = CountVectorizer() vect.fit(dataset) transf = vect.transform(dataset) normalized_word_vectors = transf / transf.sum(axis=1)
Where it would be tricky though is when you remove stop words during preprocessing but want to include them in the normalization. Then, you might have to do sth like this:
from sklearn.feature_extraction.text import CountVectorizer import numpy as np
dataset = ['The sun is shining and the weather is sweet', 'Hello World. The sun is shining and the weather is sweet']
counts = np.array([len(s.split()) for s in dataset]).reshape(-1, 1) vect = CountVectorizer(stop_words='english') vect.fit(dataset) transf = vect.transform(dataset) transf / counts
Best, Sebastian
On Jan 27, 2018, at 11:31 PM, Yacine MAZARI <y.mazari@gmail.com> wrote:
Hi Jake,
Thanks for the quick reply.
What I meant is different from the TfIdfVectorizer. Let me clarify:
In the TfIdfVectorizer, the raw counts are multiplied by IDF, which badically means normalizing the counts by document frequencies, tf * idf. But still, tf is deined here as the raw count of a term in the dicument.
What I am suggesting, is to add the possibility to use another definition of tf, tf= relative frequency of a term in a document = raw counts / document length. On top of this, one could further normalize by IDF to get the TF-IDF (https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2).
When can this be useful? Here is an example: Say term t occurs 5 times in document d1, and also 5 times in document d2. At first glance, it seems that the term conveys the same information about both documents. But if we also check document lengths, and find that length of d1 is 20, wheras lenght of d2 is 200, then probably the “importance” and information carried by the same term in the two documents is not the same. If we use relative frequency instead of absolute counts, then tf1=5/20=0.4 whereas tf2=5/200=0.04.
There are many practical cases (document similarity, document classification, etc...) where using relative frequencies yields better results, and it might be worth making the CountVectorizer support this.
Regards, Yacine.
On Sun, Jan 28, 2018 at 15:12 Jacob Vanderplas <jakevdp@cs.washington.edu> wrote: Hi Yacine, If I'm understanding you correctly, I think what you have in mind is already implemented in scikit-learn in the TF-IDF vectorizer.
Best, Jake
Jake VanderPlas Senior Data Science Fellow Director of Open Software University of Washington eScience Institute
On Sat, Jan 27, 2018 at 9:59 PM, Yacine MAZARI <y.mazari@gmail.com> wrote: Hello,
I would like to work on adding an additional feature to "sklearn.feature_extraction.text.CountVectorizer".
In the current implementation, the definition of term frequency is the number of times a term t occurs in document d.
However, another definition that is very commonly used in practice is the term frequency adjusted for document length, i.e: tf = raw counts / document length.
I intend to implement this by adding an additional boolean parameter "relative_frequency" to the constructor of CountVectorizer. If the parameter is true, normalize X by document length (along x=1) in "CountVectorizer.fit_transform()".
What do you think? If this sounds reasonable an worth it, I will send a PR.
Thank you, Yacine.
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
participants (6)
-
Gael Varoquaux -
Jacob Vanderplas -
Joel Nothman -
Roman Yurchak -
Sebastian Raschka -
Yacine MAZARI