Try this

Gabriel Genellina gagsl-py2 at yahoo.com.ar
Sun Sep 16 22:27:25 EDT 2007


En Sun, 16 Sep 2007 21:58:09 -0300, mensanator at aol.com  
<mensanator at aol.com> escribi�:

>> I'm eagerly awaiting publication of your professional specification
>> for correctly detecting the encoding of an arbitrary stream of
>> bytes
>
> The very presence of an algorithm to detect encoding is a bug.
> Files with they .txt extension should always be treated as ANSI
> even if they contain binary data.

Why ANSI? Because it's convenient to *you*? What about the rest of the  
world that don't speak English or even worse, don't use the Latin alpabet?
What do you mean by "binary data"? Notepad is not interpreting the file as  
"binary", it's text, but interpreted using the wrong encoding.

If you want to understand what happens here: The Unicode block for 'CJK  
Unified Han' goes from U+4E00 to U+9FFF and is the largest block in the  
basic plane, with more than 20000 code points. The previous block contains  
the famous 64 hexagrams, and the previous block is 'CJK Unified Han  
Extension A' ranging from U+3400 to U+4DBF.
Note that ASCII letters go from 0x41-0x5A and 0x61-7A, and the range  
0x4100-0x7AFF is totally contained inside the above Unicode blocks.  
Reading a small phrase containing only ASCII letters as it were in UTF16  
would collapse each two letters into a single character, each character  
being part of 'CJK Unified Han'. (Space and punctuation are allowed in odd  
positions only, else the character would not belong to the Han blocks).
As every character goes into the same code block the heuristics concludes  
that the text is some Estern language encoded in UTF16.
This is the "Well you are speed" phrase interpreted as UTF16:  
u'\u6557\u6c6c\u7920\u756f\u6120\u6572\u7320\u6570\u6465'

> Notepad should never be
> allowed to try to decide what the encoding is if the the open
> dialog has the encoding set to ANSI.

I'm using notepad.exe version 5.1.2600.2180 (XP SP2 fully updated) and  
that's exactly what happens. I have to explicitely select Unicode in order  
to see those Han characters.

-- 
Gabriel Genellina




More information about the Python-list mailing list