[Chicago] Chardet help
Tathagata Dasgupta
tathagatadg at gmail.com
Sun Mar 10 16:00:43 CET 2013
Good morning Chipy,
Some encoding foo to spoil the Sunday morning motivational beverage!
I am trying to read a file (
https://dl.dropbox.com/u/18146922/uniq_words_in_corpus.txt) written in
Italian - and after a bit of trial and error decided to go with chardet.
def getEncoding(infile):
import chardet
rawdata = open(infile, "r").read()
result = chardet.detect(rawdata)
charenc = result['encoding']
print charenc
That gives me ISO-8859-2.
So I do:
def filter_non_corpus_words(in_folder,out_folder,file):
import codecs
corpus_words = set(map(lambda s: s.strip(),codecs.open(file,
encoding='ISO-8859-2').readlines()))
print corpus_words
What gets printed on the output window is
http://pastebin.com/FbhQPSb2
Efff.
I open the same file in gvim/Notepad++ - no quirky \x92 stabbing my
eyeballs.
How do I get around this?
--
Cheers,
T
P.S. http://nedbatchelder.com/text/unipain/unipain.htm is very entertaining!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/chicago/attachments/20130310/1c4f27e9/attachment.html>
More information about the Chicago
mailing list