[I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model

M.-A. Lemburg mal@lemburg.com
Tue, 06 Feb 2001 18:43:05 +0100

[Moving the follow ups to i18n-sig...]

Paul Prescod wrote:
> "M.-A. Lemburg" wrote:
> >
> > ...
> >
> > Oh, I think that everybody agrees on moving to Unicode as
> > basic text storage container.
> The last time we went around there was an anti-Unicode faction who
> argued that adding Unicode support was fine but making it the default
> would inconvenience Japanese users.

Unicode is the defacto international standard for unified
script encodings. Discussing whether Unicode is good or bad is
really beyond the scope of language design and should be dealt
with in other more suitable forums, IMHO.

> > ...
> > Well, with -U on, Python will compile "" into u"", so you can
> > already test Unicode compatibility today... last I tried, Python
> > didn't even start up :-(
> I'm going to say again that I don't see that as a test of
> Unicode-compatibility. It is a test of compatibility with our existing
> Unicode object. If we simply allowed string objects to support higher
> character numbers I *cannot see* how that could break existing code.

It's a nice way of identifying problem locations in existing
Python code.

I don't understand your statement about allowing string objects
to support "higher" ordinals... are you proposing to add a third
character type ?
> > ...
> > We can use that knowledge to base future design upon. The problem
> > with many stdlib modules is that they don't make a difference
> > between text and binary data (and often can't, e.g. take sockets),
> > so we'll have to figure out a way to differentiate between the
> > two. We'll also need an easy-to-use binary data type -- as you
> > mention in the PEP, we could take the old string implementation
> > as basis and then perhaps turn u"" into "" and use b"" to mean
> > what "" does now (string object).
> I agree that we need all of this but I strongly disagree that there is
> any dependency relationship between improving the Unicode-awareness of
> I/O routines (sockets and files) and allowing string objects to support
> higher character numbers. I claim that allowing higher character numbers
> in strings will not break socket objects. It might simply be the case
> that for a while socket objects never create these higher charcters.
> Similarly, we could improve socket objects so that they have different
> readtext/readbinary and writetext/writebinary without unifying the
> string objects. There are lots of small changes we can make without
> breaking anything. One I would like to see right now is a unification of
> chr() and unichr().

This won't work: programs simply do not expect to get Unicode
characters out of chr() and would break. OTOH, programs using
unichr() don't expect 8bit-strings as output.

Let's keep the two worlds well separated for a while and
unify afterwards (this is much easier to do when everything's
in place and well tested).
> We are just making life harder for ourselves by walking further and
> further down one path when "everyone agrees" that we are eventually
> going to end up on another path.

No. We are just sending off a pioneer team to try to find an
alternative path. Once that path is found we can switch signs
to have the mainstream use the new alternative path.
> > ... It would be nice if we could avoid
> > adding more conversion magic...
> We already have more "magic" in our conversions than we need. I don't
> think I'm proposing any new conversions.

Well, let's hope so :-)

Marc-Andre Lemburg
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/