Term frequency using scikit-learn's CountVectorizer
Abdul Abdul
abdul.sw84 at gmail.com
Sun Oct 16 21:08:29 EDT 2016
Hello,
I have the following code snippet where I'm trying to list the term
frequencies, where first_textand second_text are .tex documents:
from sklearn.feature_extraction.text import CountVectorizer
training_documents = (first_text, second_text)
vectorizer = CountVectorizer()
vectorizer.fit_transform(training_documents)print "Vocabulary:",
vectorizer.vocabulary
When I run the script, I get the following:
File "test.py", line 19, in <module>
vectorizer.fit_transform(training_documents)
File "/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py",
line 817, in fit_transform
self.fixed_vocabulary_)
File "/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py",
line 752, in _count_vocab
for feature in analyze(doc):
File "/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py",
line 238, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py",
line 115, in decode
doc = doc.decode(self.encoding, self.decode_error)
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py",
line 16, in decode
return codecs.utf_8_decode(input, errors, True)UnicodeDecodeError:
'utf8' codec can't decode byte 0xa2 in position 200086: invalid start
byte
How can I fix this issue?
Thanks.
More information about the Python-list
mailing list