Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)

Guido van Rossum guido@python.org
Mon, 22 May 2000 09:16:08 -0700


> From: "Fredrik Lundh" <effbot@telia.com>
>
> Peter Funk wrote:
> > AFAIK locale and friends conform to POSIX.1.  Calling this obsolescent...
> > hmmm... may offend a *LOT* of people.  Try this on comp.os.linux.advocacy ;-)
> 
> you're missing the point -- now that we've added unicode support to
> Python, the old 8-bit locale *ctype* stuff no longer works.  while some
> platforms implement a wctype interface, it's not widely available, and it's
> not always unicode.

Huh?  We were talking strictly 8-bit strings here.  The locale support
hasn't changed there.

> so in order to provide platform-independent unicode support, Python 1.6
> comes with unicode-aware and fully portable replacements for the ctype
> functions.

For those who only need Latin-1 or another 8-bit ASCII superset, the
Unicode stuff is overkill.

> the code is already in there...
> 
> > On POSIX systems there are a several environment variables used to
> > control the default locale settings for a users session.  For example
> > on my SuSE Linux system currently running in the german locale the
> > environment variable LC_CTYPE=de_DE is automatically set by a file
> > /etc/profile during login, which causes automatically the C-library
> > function toupper('ä') to return an 'Ä' ---you should see
> > a lower case a-umlaut as argument and an upper case umlaut as return
> > value--- without having all applications to call 'setlocale' explicitly.
> >
> > So this simply works well as intended without having to add calls
> > to 'setlocale' to all application program using this C-library functions.
> 
> note that this leaves us with four string flavours in 1.6:
> 
> - 8-bit binary arrays.  may contain binary goop, or text in some strange
>   encoding.  upper, strip, etc should not be used.

These are not strings.

> - 8-bit text strings using the system encoding.  upper, strip, etc works
>   as long as the locale is properly configured.
> 
> - 8-bit unicode text strings.  upper, strip, etc may work, as long as the
>   system encoding is a subset of unicode -- which means US ASCII or
>   ISO Latin 1.

This is a figment of your imagination.  You can use 8-bit text strings
to contain Latin-1, but you have to set your locale to match.

> - wide unicode text strings.  upper, strip, etc always works.
> 
> is this complexity really worth it?

From a backwards compatibility point of view, yes.  Basically,
programs that don't use Unicode should see no change in semantics.

--Guido van Rossum (home page: http://www.python.org/~guido/)