[Python-3000] locale-aware strings ?

Wed Sep 6 02:44:37 CEST 2006

On 9/5/06, David Hopwood <david.nospam.hopwood at blueyonder.co.uk> wrote:
> Guido van Rossum wrote:
> > On 9/5/06, Paul Prescod <paul at prescod.net> wrote:
> >
> >> Beyond all of that: It just seems wrong to me that I could send someone a
> >> bunch of files and a Python program and their results processing them
> >> would be different from mine, despite the fact that we run the same version of
> >> Python on the same operating system.
> >
> > And it seems just as wrong if Python doesn't do what the user expects.
> > If I were a beginning Python user, I'd hate it if I had prepared a
> > simple data file in vi or notepad and my Python program wouldn't read
> > it right because Python's idea of encoding differs from my editor's.
>
> I don't know about vi, but notepad will open and save files that are not in
> the system ("ANSI") encoding just fine. On opening it checks for a BOM and
> auto-detects UTF-8 and UTF-16; on saving it will write a BOM if you choose
> "Unicode" (UTF-16LE), "Unicode big-endian" (UTF-16BE), or UTF-8 in the
> Encoding drop-down box.
>
> This is exactly the behaviour that most users would expect of a well-behaved
> Unicode-aware app. It should be as easy as possible to match this behaviour
> in a Python program.

And this is exactly why I want the determination of the default
encoding (i.e. the encoding to be used when opening a file when no
explicit encoding is specified by the Python code that does the
opening) to be open-ended, rather than picking some standard default
like UTF-8 and saying (like Paul seems to want to say) "this is it".

> > Sorry Paul, I appreciate your standards-driven perspective, but in
> > this area I'd rather build in more flexibility than strictly needed,
> > than too little. If it turns out that on a particular platform all
> > files are in UTF-8, making Python *on that platform* always choose
> > UTF-8 is simple enough.
>
> The problem is not the systems where all files are UTF-8, or all files are
> another known charset. The problem is the platforms where half of the files
> are UTF-8 and half are in some other charset, determined either by type or by
> presence of a UTF-8 BOM. This is a *very* common situation, especially for
> European users.

Right. (And Paul appears to be ignorant of this.)

> Such a user cannot set the locale to UTF-8, because that will break all of
> their non-Unicode-aware applications. The Unicode-aware applications typically
> have much better support for reading and writing files in charsets that are
> not the system default. So in practice the locale has to be set to the "old"
> charset during a migration to UTF-8.
>
> (Setting different locales for different applications is far too much hassle.
> On Windows, although I believe it is technically possible to do the equivalent
> of selecting a UTF-8 locale, most users don't know how to do it, even if they
> want to use UTF-8 exclusively.)

Right. Of course, "locale" and "encoding" are somewhat orthogonal
issues; the encoding may be UTF-8 but that doesn't determine other
aspects of the locale (such as language-specific collation order, or
culture-specific formatting of numbers, dates and money). Now, some
platforms may equate the two somehow, and on those platforms we would
have to inspect the locale to tell the encoding; but other platforms
may specify the encoding separate from the locale...

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)