[I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model

M.-A. Lemburg mal@lemburg.com
Tue, 06 Feb 2001 16:09:46 +0100

Paul Prescod wrote:
> "M.-A. Lemburg" wrote:
> >
> > [pre-PEP]
> >
> > You have a lot of good points in there (also some inaccuracies) and
> > I agree that Python should move to using Unicode for text data
> > and arrays for binary data.
> That's my primary goal. If we can all agree that is the goal then we can
> start to design new features with that mind. I'm overjoyed to have you
> on board. I'm pretty sure Fredrick agrees with the goals (probably not
> every implementation detail). I'll send to i18n sig and see if I can get
> buy-in from Andy Robinson et. al. Then it's just Guido.

Oh, I think that everybody agrees on moving to Unicode as
basic text storage container. The question is how to get there ;-)

Today we are facing a problem in that strings are also used as
containers for binary data and no distinction is made between
the two. We also have to watch out for external interfaces which
still use 8-bit character data, so there's a lot ahead.
> > Some things you may be missing though is that Python already
> > has support for a few features you mention, e.g. codecs.open()
> > provide more or less what you have in mind with fopen() and
> > the compiler can already unify Unicode and string literals using
> > the -U command line option.
> The problem with unifying string literals without unifying string
> *types* is that many functions probably check for and type("") not
> type(u"").

Well, with -U on, Python will compile "" into u"", so you can
already test Unicode compatibility today... last I tried, Python
didn't even start up :-(

> > What you don't talk about in the PEP is that Python's stdlib isn't
> > even Unicode aware yet, and whatever unification steps we take,
> > this project will have to preceed it.
> I'm not convinced that is true. We should be able to figure it out
> quickly though.

We can use that knowledge to base future design upon. The problem
with many stdlib modules is that they don't make a difference
between text and binary data (and often can't, e.g. take sockets),
so we'll have to figure out a way to differentiate between the
two. We'll also need an easy-to-use binary data type -- as you
mention in the PEP, we could take the old string implementation
as basis and then perhaps turn u"" into "" and use b"" to mean
what "" does now (string object).
> > The problem with making the
> > stdlib Unicode aware is that of deciding which parts deal with
> > text data or binary data -- the code sometimes makes assumptions
> > about the nature of the data and at other times it simply doesn't
> > care.
> Can you give an example? If the new string type is 100% backwards
> compatible in every way with the old string type then the only code that
> should break is silly code that did stuff like:
> try:
>     something = chr( somethingelse )
> except ValueError:
>     print "Unicode is evil!"
> Note that I expect types.StringType == types(chr(10000)) etc.

Sure, but there are interfaces which don't differentiate between
text and binary data, e.g. many IO-operations don't care about
what exactly they are writing or reading.
We'd probably define a new set of text data APIs (meaning
methods) to make this difference clear and visible, e.g.
.writetext() and .readtext().

> > In this light I think you ought to focus Python 3k with your
> > PEP. This will also enable better merging techniques due to the
> > lifting of the type/class difference.
> Python3K is a beautiful dream but we have problems we need to solve
> today. We could start moving to a Unicode future in baby steps right
> now. Your "open" function could be moved into builtins as "fopen".
> Python's "binary" open function could be deprecated under its current
> name and perhaps renamed.

Hmm, I'd prefer to keep things separate for a while and then
switch over to new APIs once we get used to them.
> The sooner we start the sooner we finish. You and /F laid some beautiful
> groundwork. Now we just need to keep up the momentum. I think we can do
> this without a big backwards compatibility earthquake. VB and TCL
> figured out how to do it...

... and we should probably try to learn from them. They have
put a considerable amount of work into getting the low-level
interfacing issues straight. It would be nice if we could avoid
adding more conversion magic...

Marc-Andre Lemburg
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/