[Python-Dev] Pre-PEP: Python Character Model

Paul Prescod paulp@ActiveState.com
Tue, 06 Feb 2001 07:54:49 -0800


"M.-A. Lemburg" wrote:
> 
> ...
> 
> Oh, I think that everybody agrees on moving to Unicode as
> basic text storage container. 

The last time we went around there was an anti-Unicode faction who
argued that adding Unicode support was fine but making it the default
would inconvenience Japanese users.

> ...
> Well, with -U on, Python will compile "" into u"", so you can
> already test Unicode compatibility today... last I tried, Python
> didn't even start up :-(

I'm going to say again that I don't see that as a test of
Unicode-compatibility. It is a test of compatibility with our existing
Unicode object. If we simply allowed string objects to support higher
character numbers I *cannot see* how that could break existing code.

> ...
> We can use that knowledge to base future design upon. The problem
> with many stdlib modules is that they don't make a difference
> between text and binary data (and often can't, e.g. take sockets),
> so we'll have to figure out a way to differentiate between the
> two. We'll also need an easy-to-use binary data type -- as you
> mention in the PEP, we could take the old string implementation
> as basis and then perhaps turn u"" into "" and use b"" to mean
> what "" does now (string object).

I agree that we need all of this but I strongly disagree that there is
any dependency relationship between improving the Unicode-awareness of
I/O routines (sockets and files) and allowing string objects to support
higher character numbers. I claim that allowing higher character numbers
in strings will not break socket objects. It might simply be the case
that for a while socket objects never create these higher charcters.

Similarly, we could improve socket objects so that they have different
readtext/readbinary and writetext/writebinary without unifying the
string objects. There are lots of small changes we can make without
breaking anything. One I would like to see right now is a unification of
chr() and unichr().

We are just making life harder for ourselves by walking further and
further down one path when "everyone agrees" that we are eventually
going to end up on another path.

> ... It would be nice if we could avoid
> adding more conversion magic...

We already have more "magic" in our conversions than we need. I don't
think I'm proposing any new conversions.

 Paul Prescod