[Python-3000] Help on text editors

David Hopwood david.nospam.hopwood at blueyonder.co.uk
Fri Sep 8 04:46:40 CEST 2006


Michael Urman wrote:
> On 9/7/06, Paul Prescod <paul at prescod.net> wrote:
> 
>>1. On US English Windows, Notepad defaults to an encoding called "ANSI".
>>What does "ANSI" map to in European and Asian versions of Windows?
> 
> On most Western European configurations, the ANSI Code Page is
> historically 1252 (CP1252 or WINDOWS-1252 according to iconv). It may
> be something different now for supporting the EURO symbol.

None of the Windows-125x code page numbers changed when '€' was added. These
are "open" encodings in the Unicode and ISO terminology; i.e. there is an
authority (Microsoft) who can assign any previously unassigned code point at
any time.

> Japanese machines tend to use CP932 (or MS932), also known as SHIFT-JIS (or
> close enough).

Not close enough, actually. Cp932 is a superset of US-ASCII, whereas Shift-JIS
isn't: 0x5C represents '\' and '¥' respectively. If you think about how
important '\' is as an escaping metacharacter, this is quite a big deal
(there are other differences, but they are less important). Actual practice
in Japan is that 0x5C *can* be used as an escaping metacharacter with the
semantics of '\' (even if it is sometimes displayed as '¥'), and so Cp932 is
the encoding that should be used, even on non-Microsoft OSes.

> I expect notepad will default to the ACP encoding whenever a file is
> detected as such, or a new file contains only characters representable
> via that code page. Otherwise I expect it will default to "Unicode"
> (UTF-16 / UCS-2). When editing an existing file, it will default to
> the detected encoding, unless "Unicode" is required to save the
> changes. It uses BOMs to mark all unicode encodings, but doesn't
> require them to be present in order to detect "Unicode."
> http://blogs.msdn.com/michkap/archive/2006/06/14/631016.aspx

Yes. However, this is not a good idea for precisely the reason described
on that page (false detection of Unicode), and so any Unicode detection
algorithm in Python should only be based on detecting a BOM, IMHO.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>






More information about the Python-3000 mailing list