[Chicago] Chardet help
Tathagata Dasgupta
tathagatadg at gmail.com
Sun Mar 10 20:02:34 CET 2013
Naice .... from the wikipage
"This encoding is a superset of ISO 8859-1, but differs from the IANA's
ISO-8859-1 by using displayable characters rather than control characters
in the 80 to 9F (hex) range."
The utility file is confused too ...
# file -bi uniq_words_in_corpus.txt
text/plain; charset=unknown-8bit
Tried all other iso-8859-*, but ultimately Windows‑1252 worked out ...
corpus_words = set(map(lambda s: s.strip(),codecs.open(file,
encoding='Windows‑1252').readlines()))
for i in sorted(corpus_words):
print i.encode("Windows‑1252")
the print i.encode("..") got it working ...
Martin, Clyde, Jeff - Thanks guys for the help :D !
Have a great week ahead Chipy ...
On Sun, Mar 10, 2013 at 12:49 PM, Jeff Hinrichs - DM&T
<jeffh at dundeemt.com>wrote:
> If it is in fact looking like 8559-1 then you should be using cp-1252,
> that is what HTML5 does. see http://en.wikipedia.org/wiki/Windows-1252
>
>
> Best,
> Jeff
>
>
>
> On Sun, Mar 10, 2013 at 12:38 PM, Clyde Forrester <
> clydeforrester at gmail.com> wrote:
>
>> According to Firefox, the encoding is windows-1252.
>>
>> On 3/10/2013 10:00 AM, Tathagata Dasgupta wrote:
>>
>> Good morning Chipy,
>> Some encoding foo to spoil the Sunday morning motivational beverage!
>>
>> I am trying to read a file (
>> https://dl.dropbox.com/u/18146922/uniq_words_in_corpus.txt) written in
>> Italian - and after a bit of trial and error decided to go with chardet.
>>
>>
>> def getEncoding(infile):
>> import chardet
>> rawdata = open(infile, "r").read()
>> result = chardet.detect(rawdata)
>> charenc = result['encoding']
>> print charenc
>>
>> That gives me ISO-8859-2.
>>
>>
>>
>> _______________________________________________
>> Chicago mailing list
>> Chicago at python.org
>> http://mail.python.org/mailman/listinfo/chicago
>>
>>
>
>
> --
> Best,
>
> Jeff Hinrichs
> 402.218.1473
>
>
> _______________________________________________
> Chicago mailing list
> Chicago at python.org
> http://mail.python.org/mailman/listinfo/chicago
>
>
--
Cheers,
T
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/chicago/attachments/20130310/e5c13a9f/attachment.html>
More information about the Chicago
mailing list