Try this
Gabriel Genellina
gagsl-py2 at yahoo.com.ar
Sun Sep 16 22:27:25 EDT 2007
En Sun, 16 Sep 2007 21:58:09 -0300, mensanator at aol.com
<mensanator at aol.com> escribi�:
>> I'm eagerly awaiting publication of your professional specification
>> for correctly detecting the encoding of an arbitrary stream of
>> bytes
>
> The very presence of an algorithm to detect encoding is a bug.
> Files with they .txt extension should always be treated as ANSI
> even if they contain binary data.
Why ANSI? Because it's convenient to *you*? What about the rest of the
world that don't speak English or even worse, don't use the Latin alpabet?
What do you mean by "binary data"? Notepad is not interpreting the file as
"binary", it's text, but interpreted using the wrong encoding.
If you want to understand what happens here: The Unicode block for 'CJK
Unified Han' goes from U+4E00 to U+9FFF and is the largest block in the
basic plane, with more than 20000 code points. The previous block contains
the famous 64 hexagrams, and the previous block is 'CJK Unified Han
Extension A' ranging from U+3400 to U+4DBF.
Note that ASCII letters go from 0x41-0x5A and 0x61-7A, and the range
0x4100-0x7AFF is totally contained inside the above Unicode blocks.
Reading a small phrase containing only ASCII letters as it were in UTF16
would collapse each two letters into a single character, each character
being part of 'CJK Unified Han'. (Space and punctuation are allowed in odd
positions only, else the character would not belong to the Han blocks).
As every character goes into the same code block the heuristics concludes
that the text is some Estern language encoded in UTF16.
This is the "Well you are speed" phrase interpreted as UTF16:
u'\u6557\u6c6c\u7920\u756f\u6120\u6572\u7320\u6570\u6465'
> Notepad should never be
> allowed to try to decide what the encoding is if the the open
> dialog has the encoding set to ANSI.
I'm using notepad.exe version 5.1.2600.2180 (XP SP2 fully updated) and
that's exactly what happens. I have to explicitely select Unicode in order
to see those Han characters.
--
Gabriel Genellina
More information about the Python-list
mailing list