[I18n-sig] Re: Pre-PEP: Proposed Python Character Model

Guido van Rossum guido@digicool.com
Tue, 20 Feb 2001 16:54:25 -0500


> "Martin v. Loewis" wrote:
> > Latin-1 is group 0, plane 0, row 0. Why is it any better than any
> > other plane or row?
> 
> I don't know. You tell me.
> 
> >>> "a"==u"a"==chr(97)
> 1
> 
> It looks like we've already decided that group 0, plane 0, row 0 is
> special. A better question is why if the first half of group 0, plane 0,
> row 0 better than the last half?
> 
> >>> unichr(160)==chr(160)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeError: ASCII decoding error: ordinal not in range(128)
> 
> The Unicode guys made group 0, plane 0, row 0 Latin-1 for a reason. It's
> not just an accident. I don't think it makes sense for us to agree with
> them "halfway"...especially when this half-way agreement causes all
> kinds of nasty problems like forcing Python to raise exceptions in
> places that are really surprising like equality tests and sort
> functions.

This has been hashed to death many times before.  We have absolutely
no guarantee that the files from which Python strings are read are
encoded in Latin-1, but we do know pretty sure that they are an ASCII
superset (if they represent characters at all).  Using the locale
module the user can (implicitly) indicate what the character set is,
and this may not be Latin-1.  Since s.islower() and other similar
functions are locale-sensitive, it would be inconsistent to declare
that 8-bit strings are always encoded in Latin-1.  This is historical
baggage that cannot easily be fixed without breaking lots of code
handling character data using legacy encodings (and typically, such
code is not served by a switch to Unicode).  It's possible to change
locales in mid-execution, but for various reasons it's bad to change
the default encoding in mid-execution, so the best we can do is assume
ASCII as the default encoding.

--Guido van Rossum (home page: http://www.python.org/~guido/)