[Chicago] Chardet help

Tathagata Dasgupta tathagatadg at gmail.com
Sun Mar 10 16:00:43 CET 2013


Good morning Chipy,
Some encoding foo to spoil the Sunday morning motivational beverage!

I am trying to read a file (
https://dl.dropbox.com/u/18146922/uniq_words_in_corpus.txt)   written in
Italian - and after a bit of trial and error decided to go with chardet.


def getEncoding(infile):
import chardet
rawdata = open(infile, "r").read()
 result = chardet.detect(rawdata)
charenc = result['encoding']
print charenc

That gives me ISO-8859-2.

So I do:

def filter_non_corpus_words(in_folder,out_folder,file):
import codecs
 corpus_words = set(map(lambda s: s.strip(),codecs.open(file,
encoding='ISO-8859-2').readlines()))
print corpus_words

What gets printed on the output window is
http://pastebin.com/FbhQPSb2

Efff.

I open the same file in gvim/Notepad++ - no quirky \x92 stabbing my
eyeballs.
How do I get around this?

-- 
Cheers,
T

P.S. http://nedbatchelder.com/text/unipain/unipain.htm is very entertaining!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/chicago/attachments/20130310/1c4f27e9/attachment.html>


More information about the Chicago mailing list