[I18n-sig] Pre-PEP: Python Character Model

M.-A. Lemburg mal@lemburg.com
Wed, 07 Feb 2001 12:47:53 +0100

Paul Prescod wrote:
> "M.-A. Lemburg" wrote:
> >...
> >
> > I don't understand your statement about allowing string objects
> > to support "higher" ordinals... are you proposing to add a third
> > character type ?
> Yes and no. I want to make a type with a superset of the functionality
> of strings and Unicode strings.

Hmm and I was under the impression that we try to replace
strings with Unicode and then perhaps reuse the 8-bit string
implementation for binary data.

> > > Similarly, we could improve socket objects so that they have different
> > > readtext/readbinary and writetext/writebinary without unifying the
> > > string objects. There are lots of small changes we can make without
> > > breaking anything.
> Before we go on: do you agree that we could add fopen and
> readtext/readbinary on various I/O types without breaking anything? And
> that that we should do so?

Sure. We can always add new things, then deprecate the old stuff
and slowly move to the new methods as standard. E.g. adding
.readtext() and .writetext() would be a good start in that
direction since those names make it clear that the code will
deal with text rather than binary data.
> > > One I would like to see right now is a unification of
> > > chr() and unichr().
> >
> > This won't work: programs simply do not expect to get Unicode
> > characters out of chr() and would break.
> Why would a program pass a large integer to chr() if it cannot handle
> the resulting wide string????

As result of an error. Ok, some other part in the program will
then probably break, but this hides the original error location.
> > OTOH, programs using
> > unichr() don't expect 8bit-strings as output.
> Where would an 8bit string break code that expected a Unicode string?
> The upward conversion is automatic and lossless!

But why would you want to do upward conversion on single characters ?
That would only cost performance.
> Having chr() and unichr() is like having a special function for adding
> integers versus longs. IMO it is madness.

No. chr() is a constructor for a single 8-bit character, unichr()
is the corresponding constructor for a single Unicode character.
This is much like the difference between int() and long().

> > Let's keep the two worlds well separated for a while and
> > unify afterwards (this is much easier to do when everything's
> > in place and well tested).
> No, the more we keep the worlds seperated the more code will be written
> that expects to deal with two separate types. We need to get people
> thinking in terms of strings of characters not strings of bytes and we
> need to do it as soon as possible.

Ok, then let me put it this way: let's first make people aware
that there is an important difference between text data and
binary data. Once this is being accepted, we can move on to
thinking about making Unicode the standard for text data.

Marc-Andre Lemburg
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/