[Python-3000] locale-aware strings ?

Mon Sep 4 17:50:51 CEST 2006

Guido van Rossum wrote:
> On 9/3/06, Jim Jewett <jimjjewett at gmail.com> wrote:
> 
>>Two followup questions, then ...
>>
>>(1)  To what extent should python support files (including stdin,
>>stdout) in local (non-unicode) encodings?  (not at all, per-file,
>>settable global default?)

Per-file, I hope.

> I've always said (can someone find a quote perhaps?) that there ought
> to be a sensible default encoding for files (including but not limited
> to stdin/out/err), perhaps influenced by personalized settings,
> environment variables, the OS, etc.

While it should be possible to find out what the OS believes to be
the current "system" charset (GetCPInfoEx(CP_ACP, ...) on Windows;
LC_CHARSET environment variable on Unix), that does not mean that it
is this charset that Python programs should normally use. When defining
a new text-based file type, it is simpler to define it to be always UTF-8.

>>(2)  To what extent will strings have an opaque (or at least
>>on-demand) backing store, so that decoding/encoding could be delayed?
>>(For example, Swedish text could be stored in single-byte characters,
>>and only converted to standard unicode on the rare occasions when it
>>met strings in an incompatible encoding.)
> 
> That seems to be a bit of a leading question. Talin is currently
> championing strings with different fixed-width storage, and others
> have proposed even more flexible "polymorphic strings". You might want
> to learn about the NSString type on Apple's ObjectiveC.

Operating on encoded constant strings, and decoding each character on the
fly, works fine when the charset is stateless and each character has a 1-1
correspondance with a Unicode character (i.e. code point). In that case
the program can operate on the string essentially as if it were Unicode.
It still works fine for variable-width charsets (including UTF-8 and
UTF-16); that just means that the program has to avoid assuming that a
position in the string is the same thing as a character count.

For charsets like ISCII and ISO 2022, which are stateful and/or have
a different encoding model to Unicode, I don't believe this approach
would work very well. But it is fine to support this for some charsets
and not others.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>