<div dir="ltr"><div><div><div><div><div><div><div><div><div>Hello,<br><br></div>I would like to work on adding an additional feature to "sklearn.feature_extraction.text.CountVectorizer".<br><br></div>In the current implementation, the definition of term frequency is the number of times a term t occurs in document d.<br></div><br>However, another definition that is very commonly used in practice is the <a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2">term frequency adjusted for document length</a>, i.e: tf = raw counts / document length.<br><br></div>I intend to implement this by adding an additional boolean parameter "relative_frequency" to the constructor of CountVectorizer.<br></div>If the parameter is true, normalize X by document length (along x=1) in "CountVectorizer.fit_transform()".<br><br></div>What do you think?<br></div>If this sounds reasonable an worth it, I will send a PR.<br><br></div>Thank you,<br></div>Yacine.<br></div>