[Python-3000] locale-aware strings ?
paul at prescod.net
Wed Sep 6 04:00:06 CEST 2006
On 9/5/06, David Hopwood <david.nospam.hopwood at blueyonder.co.uk> wrote:
> Guido van Rossum wrote:
> > On 9/5/06, Brian Quinlan <brian at sweetapp.com> wrote:
> > [...]
> > That would not be doing what the user wants. We have extensive
> > experience with defaulting to ASCII in Python 2.x and it's mostly bad.
> > There should definitely be a way to force ASCII as the default
> > encoding (if only as a debugging aid), both in the program code and in
> > the environment; but it shouldn't be the only default. There should
> > also be a way to force UTF-8 as the default, or ISO-8859-1. But if
> > CP436 is the default encoding set by the OS I don't see why Python
> > shouldn't use that as the default *in the absence of any other
> > preferences*.
> Cp436 is almost certainly *not* the encoding set by the OS; Python
> has got it wrong. If Brian is using an English-language variant of
> Windows XP and has not changed the defaults, the system ("ANSI")
> encoding will be Cp1252-with-Euro (which is similar enough to ISO-8859-1
> if C1 control characters are not used).
"There are at least two code pages in use on most PCs. Applications using
the Windows graphical user interface use the Windows code pages. These code
pages are compatible with ISO character sets, and also with ANSI character
sets. They are often referred to as *ANSI code pages*.
Character-mode applications (those using the console or command prompt
window) in Windows 95/98/Me and Windows NT/200/XP, use code pages that were
used in DOS. These are called *OEM code pages* (Original Equipment
Manufacturer) for historical reasons.
Consider the following situation:
A PC is running a Windows operating system with ANSI code page 1252.
The code page for character-mode applications is OEM code page 437.
Text is held in a database created using the collation UTF8.
An upper case A grave in the database is stored as hex byes C380. In a
Windows application, the same character is represented as hex CO. In a DOS
application, it is represented as hex B7."
Now notice that when we introduce Unicode (and all Python 3K strings are
Unicode), we aren't talking about DISPLAY of characters. We're talking about
INTERPRETATION of characters. So if I read a file and then merge it with
some XML data then a Windows default encoding-using application will create
different output in a Python script run from the command line versus run
from the Windows desktop. Same app. Same data. Different default encodings.
Of course we could arbitrarily choose one of these two encodings as the
"true" one, but the fact that they are ALMOST ALWAYS inconsistent indicates
something about how likely either one is to be correct for a particular
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Python-3000