<div dir="ltr">Naice .... from the wikipage<div><br></div><div>"This encoding is a superset of ISO 8859-1, but differs from the IANA's ISO-8859-1 by using displayable characters rather than control characters in the 80 to 9F (hex) range."<div>
<br><div><div style>The utility file is confused too ...</div><div style><br></div><div># file -bi uniq_words_in_corpus.txt</div><div>text/plain; charset=unknown-8bit</div></div></div></div><div><br></div><div style>Tried all other iso-8859-*, but ultimately Windows‑1252 worked out ...</div>
<div><br></div><div><div>corpus_words = set(map(lambda s: s.strip(),codecs.open(file, encoding='Windows‑1252').readlines()))<br></div><div>for i in sorted(corpus_words):</div><div><span class="" style="white-space:pre"> </span>print i.encode("Windows‑1252")</div>
</div><div><br></div><div style>the print i.encode("..") got it working ... </div><div style><br></div><div style><br></div><div style>Martin, Clyde, Jeff - Thanks guys for the help :D !</div><div><br></div><div style>
Have a great week ahead Chipy ...</div><div style><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Sun, Mar 10, 2013 at 12:49 PM, Jeff Hinrichs - DM&T <span dir="ltr"><<a href="mailto:jeffh@dundeemt.com" target="_blank">jeffh@dundeemt.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">If it is in fact looking like 8559-1 then you should be using cp-1252, that is what HTML5 does. see <a href="http://en.wikipedia.org/wiki/Windows-1252" target="_blank">http://en.wikipedia.org/wiki/Windows-1252</a> <div>
<br></div><div>Best,</div><div>Jeff</div><div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote"><div><div class="h5">On Sun, Mar 10, 2013 at 12:38 PM, Clyde Forrester <span dir="ltr"><<a href="mailto:clydeforrester@gmail.com" target="_blank">clydeforrester@gmail.com</a>></span> wrote:<br>
</div></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div class="h5">
<div bgcolor="#FFFFFF" text="#000000">
<div>According to Firefox, the encoding is
windows-1252.<br>
<br>
On 3/10/2013 10:00 AM, Tathagata Dasgupta wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>Good morning Chipy,</div>
<div>Some encoding foo to spoil the Sunday morning
motivational beverage!</div>
<div><br>
</div>
I am trying to read a file (<a href="https://dl.dropbox.com/u/18146922/uniq_words_in_corpus.txt" target="_blank">https://dl.dropbox.com/u/18146922/uniq_words_in_corpus.txt</a>)
written in Italian - and after a bit of trial and error
decided to go with chardet.
<div>
<br>
</div>
<div><br>
</div>
<div>
<div>def getEncoding(infile):</div>
<div><span style="white-space:pre-wrap"> </span>import
chardet </div>
<div><span style="white-space:pre-wrap"> </span>rawdata =
open(infile, "r").read()</div>
<div><span style="white-space:pre-wrap"> </span>result =
chardet.detect(rawdata)</div>
<div><span style="white-space:pre-wrap"> </span>charenc =
result['encoding']</div>
<div><span style="white-space:pre-wrap"> </span>print
charenc</div>
<div><br>
</div>
<div>That gives me ISO-8859-2. </div>
<br>
</div>
</div>
</blockquote>
<br>
</div>
<br></div></div><div class="im">_______________________________________________<br>
Chicago mailing list<br>
<a href="mailto:Chicago@python.org" target="_blank">Chicago@python.org</a><br>
<a href="http://mail.python.org/mailman/listinfo/chicago" target="_blank">http://mail.python.org/mailman/listinfo/chicago</a><br>
<br></div></blockquote></div><span class="HOEnZb"><font color="#888888"><br><br clear="all"><div><br></div>-- <br>Best,<div><br></div><div>Jeff Hinrichs<br><a href="tel:402.218.1473" value="+14022181473" target="_blank">402.218.1473</a><br>
<br></div>
</font></span></div>
<br>_______________________________________________<br>
Chicago mailing list<br>
<a href="mailto:Chicago@python.org">Chicago@python.org</a><br>
<a href="http://mail.python.org/mailman/listinfo/chicago" target="_blank">http://mail.python.org/mailman/listinfo/chicago</a><br>
<br></blockquote></div><br><br clear="all"><div><br></div>-- <br>Cheers,<br>T
</div>