Re: [Python-Dev] Pre-PEP: Python Character Model

6 Feb 2001

      Paul Prescod wrote:
...
"M.-A. Lemburg" wrote:
...
[pre-PEP]
You have a lot of good points in there (also some inaccuracies) and
I agree that Python should move to using Unicode for text data
and arrays for binary data.
That's my primary goal. If we can all agree that is the goal then we can
start to design new features with that mind. I'm overjoyed to have you
on board. I'm pretty sure Fredrick agrees with the goals (probably not
every implementation detail). I'll send to i18n sig and see if I can get
buy-in from Andy Robinson et. al. Then it's just Guido.
Oh, I think that everybody agrees on moving to Unicode as
basic text storage container. The question is how to get there ;-)

Today we are facing a problem in that strings are also used as
containers for binary data and no distinction is made between
the two. We also have to watch out for external interfaces which
still use 8-bit character data, so there's a lot ahead.
...
...
Some things you may be missing though is that Python already
has support for a few features you mention, e.g. codecs.open()
provide more or less what you have in mind with fopen() and
the compiler can already unify Unicode and string literals using
the -U command line option.
The problem with unifying string literals without unifying string
*types* is that many functions probably check for and type("") not
type(u"").
Well, with -U on, Python will compile "" into u"", so you can
already test Unicode compatibility today... last I tried, Python
didn't even start up :-(
...
...
What you don't talk about in the PEP is that Python's stdlib isn't
even Unicode aware yet, and whatever unification steps we take,
this project will have to preceed it.
I'm not convinced that is true. We should be able to figure it out
quickly though.
We can use that knowledge to base future design upon. The problem
with many stdlib modules is that they don't make a difference
between text and binary data (and often can't, e.g. take sockets),
so we'll have to figure out a way to differentiate between the
two. We'll also need an easy-to-use binary data type -- as you
mention in the PEP, we could take the old string implementation
as basis and then perhaps turn u"" into "" and use b"" to mean
what "" does now (string object).
...
...
The problem with making the
stdlib Unicode aware is that of deciding which parts deal with
text data or binary data -- the code sometimes makes assumptions
about the nature of the data and at other times it simply doesn't
care.
Can you give an example? If the new string type is 100% backwards
compatible in every way with the old string type then the only code that
should break is silly code that did stuff like:
try:
    something = chr( somethingelse )
except ValueError:
    print "Unicode is evil!"
Note that I expect types.StringType == types(chr(10000)) etc.
Sure, but there are interfaces which don't differentiate between
text and binary data, e.g. many IO-operations don't care about
what exactly they are writing or reading.

We'd probably define a new set of text data APIs (meaning
methods) to make this difference clear and visible, e.g.
.writetext() and .readtext().
...
...
In this light I think you ought to focus Python 3k with your
PEP. This will also enable better merging techniques due to the
lifting of the type/class difference.
Python3K is a beautiful dream but we have problems we need to solve
today. We could start moving to a Unicode future in baby steps right
now. Your "open" function could be moved into builtins as "fopen".
Python's "binary" open function could be deprecated under its current
name and perhaps renamed.
Hmm, I'd prefer to keep things separate for a while and then
switch over to new APIs once we get used to them.
...
The sooner we start the sooner we finish. You and /F laid some beautiful
groundwork. Now we just need to keep up the momentum. I think we can do
this without a big backwards compatibility earthquake. VB and TCL
figured out how to do it...
... and we should probably try to learn from them. They have
put a considerable amount of work into getting the low-level
interfacing issues straight. It would be nice if we could avoid
adding more conversion magic...

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/