<div dir="ltr"><div style>Good morning Chipy,</div><div style>Some encoding foo to spoil the Sunday morning motivational beverage!</div><div style><br></div>I am trying to read a file (<a href="https://dl.dropbox.com/u/18146922/uniq_words_in_corpus.txt">https://dl.dropbox.com/u/18146922/uniq_words_in_corpus.txt</a>) written in Italian - and after a bit of trial and error decided to go with chardet.<div>
<br></div><div><br></div><div><div>def getEncoding(infile):</div><div><span class="" style="white-space:pre"> </span>import chardet </div><div><span class="" style="white-space:pre"> </span>rawdata = open(infile, "r").read()</div>
<div><span class="" style="white-space:pre"> </span>result = chardet.detect(rawdata)</div><div><span class="" style="white-space:pre"> </span>charenc = result['encoding']</div><div><span class="" style="white-space:pre"> </span>print charenc</div>
<div><br></div><div style>That gives me ISO-8859-2. </div><div style><br></div><div style>So I do:</div><div style><br></div><div style><div>def filter_non_corpus_words(in_folder,out_folder,file):</div><div><span class="" style="white-space:pre"> </span>import codecs</div>
<div><span class="" style="white-space:pre"> </span>corpus_words = set(map(lambda s: s.strip(),codecs.open(file, encoding='ISO-8859-2').readlines()))</div><div><span class="" style="white-space:pre"> </span>print corpus_words</div>
<div><br></div><div style>What gets printed on the output window is </div></div><div style><a href="http://pastebin.com/FbhQPSb2">http://pastebin.com/FbhQPSb2</a><br></div><div style><br></div><div style>Efff. </div><div style>
<br></div><div style>I open the same file in gvim/Notepad++ - no quirky \x92 stabbing my eyeballs.</div><div style>How do I get around this?</div><div><br></div>-- <br>Cheers,<br>T
</div><div><br></div><div style>P.S. <a href="http://nedbatchelder.com/text/unipain/unipain.htm">http://nedbatchelder.com/text/unipain/unipain.htm</a> is very entertaining!</div></div>