[Python-3000] locale-aware strings ?

David Hopwood david.nospam.hopwood at blueyonder.co.uk
Wed Sep 6 04:52:28 CEST 2006


Paul Prescod wrote:
> On 9/5/06, David Hopwood <david.nospam.hopwood at blueyonder.co.uk> wrote:
>> Guido van Rossum wrote:
>> > On 9/5/06, Brian Quinlan <brian at sweetapp.com> wrote:
>> > [...]
>> >
>> > That would not be doing what the user wants. We have extensive
>> > experience with defaulting to ASCII in Python 2.x and it's mostly bad.
>> > There should definitely be a way to force ASCII as the default
>> > encoding (if only as a debugging aid), both in the program code and in
>> > the environment; but it shouldn't be the only default. There should
>> > also be a way to force UTF-8 as the default, or ISO-8859-1. But if
>> > CP436 is the default encoding set by the OS I don't see why Python
>> > shouldn't use that as the default *in the absence of any other
>> > preferences*.
>>
>> Cp436 is almost certainly *not* the encoding set by the OS; Python
>> has got it wrong. If Brian is using an English-language variant of
>> Windows XP and has not changed the defaults, the system ("ANSI")
>> encoding will be Cp1252-with-Euro (which is similar enough to ISO-8859-1
>> if C1 control characters are not used).
> 
> http://www.ianywhere.com/developer/product_manuals/sqlanywhere/0902/en/html/dbdaen9/00000376.htm
> 
> "There are at least two code pages in use on most PCs. Applications using
> the Windows graphical user interface use the Windows code pages. These code
> pages are compatible with ISO character sets, and also with ANSI character
> sets. They are often referred to as *ANSI code pages*.
> 
> Character-mode applications (those using the console or command prompt
> window) in Windows 95/98/Me and Windows NT/200/XP, use code pages that were
> used in DOS. These are called *OEM code pages* (Original Equipment
> Manufacturer) for historical reasons.

True, I oversimplified.

In practice, each text file on a Windows system is somewhat more likely to be
encoded in the ANSI charset than in the OEM charset (unless the user still
commonly uses DOS-era applications). The OEM charset only exists at all as a
compatibility hack.

> Of course we could arbitrarily choose one of these two encodings as the
> "true" one, but the fact that they are ALMOST ALWAYS inconsistent indicates
> something about how likely either one is to be correct for a particular
> user's goals.

Right -- it's impossible to make a clear distinction between "files used by
console applications" and "files used by graphical applications", since any
text file can be used by both. This just supports my assertion that there
should not be a "default" encoding at all.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>




More information about the Python-3000 mailing list