[Chicago] Chardet help

Tathagata Dasgupta tathagatadg at gmail.com
Sun Mar 10 20:02:34 CET 2013


Naice .... from the wikipage

"This encoding is a superset of ISO 8859-1, but differs from the IANA's
ISO-8859-1 by using displayable characters rather than control characters
in the 80 to 9F (hex) range."

The utility file is confused too ...

# file -bi uniq_words_in_corpus.txt
text/plain; charset=unknown-8bit

Tried all other iso-8859-*, but ultimately Windows‑1252 worked out ...

corpus_words = set(map(lambda s: s.strip(),codecs.open(file,
encoding='Windows‑1252').readlines()))
for i in sorted(corpus_words):
print i.encode("Windows‑1252")

the print i.encode("..") got it working ...


Martin, Clyde, Jeff  - Thanks guys for the help :D !

Have a great week ahead Chipy ...



On Sun, Mar 10, 2013 at 12:49 PM, Jeff Hinrichs - DM&T
<jeffh at dundeemt.com>wrote:

> If it is in fact looking like 8559-1 then you should be using cp-1252,
>  that is what HTML5 does.   see http://en.wikipedia.org/wiki/Windows-1252
>
>
> Best,
> Jeff
>
>
>
> On Sun, Mar 10, 2013 at 12:38 PM, Clyde Forrester <
> clydeforrester at gmail.com> wrote:
>
>>  According to Firefox, the encoding is windows-1252.
>>
>> On 3/10/2013 10:00 AM, Tathagata Dasgupta wrote:
>>
>>  Good morning Chipy,
>> Some encoding foo to spoil the Sunday morning motivational beverage!
>>
>>  I am trying to read a file (
>> https://dl.dropbox.com/u/18146922/uniq_words_in_corpus.txt)   written in
>> Italian - and after a bit of trial and error decided to go with chardet.
>>
>>
>>  def getEncoding(infile):
>>  import chardet
>>  rawdata = open(infile, "r").read()
>>  result = chardet.detect(rawdata)
>>  charenc = result['encoding']
>>  print charenc
>>
>>  That gives me ISO-8859-2.
>>
>>
>>
>> _______________________________________________
>> Chicago mailing list
>> Chicago at python.org
>> http://mail.python.org/mailman/listinfo/chicago
>>
>>
>
>
> --
> Best,
>
> Jeff Hinrichs
> 402.218.1473
>
>
> _______________________________________________
> Chicago mailing list
> Chicago at python.org
> http://mail.python.org/mailman/listinfo/chicago
>
>


-- 
Cheers,
T
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/chicago/attachments/20130310/e5c13a9f/attachment.html>


More information about the Chicago mailing list