Term frequency using scikit-learn's CountVectorizer
MRAB
python at mrabarnett.plus.com
Sun Oct 16 21:42:31 EDT 2016
On 2016-10-17 02:04, Abdul Abdul wrote:
> I have the following code snippet where I'm trying to list the term frequencies, where first_text and second_text are .tex documents:
>
> from sklearn.feature_extraction.text import CountVectorizer
> training_documents = (first_text, second_text)
> vectorizer = CountVectorizer()
> vectorizer.fit_transform(training_documents)
> print "Vocabulary:", vectorizer.vocabulary
> When I run the script, I get the following:
>
> File "test.py", line 19, in <module>
> vectorizer.fit_transform(training_documents)
> File "/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 817, in fit_transform
> self.fixed_vocabulary_)
> File "/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 752, in _count_vocab
> for feature in analyze(doc):
> File "/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 238, in <lambda>
> tokenize(preprocess(self.decode(doc))), stop_words)
> File "/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 115, in decode
> doc = doc.decode(self.encoding, self.decode_error)
> File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0xa2 in position 200086: invalid start byte
> How can I fix this issue?
>
I've had a quick look at the docs here:
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer
and I think you need to tell it what encoding the text actually uses. By
default CountVectorizer assumes the text uses UTF-8, but, clearly, your
text uses a different encoding.
More information about the Python-list
mailing list