Paul Prescod wrote:
"M.-A. Lemburg" wrote:
[pre-PEP]
You have a lot of good points in there (also some inaccuracies) and I agree that Python should move to using Unicode for text data and arrays for binary data.
That's my primary goal. If we can all agree that is the goal then we can start to design new features with that mind. I'm overjoyed to have you on board. I'm pretty sure Fredrick agrees with the goals (probably not every implementation detail). I'll send to i18n sig and see if I can get buy-in from Andy Robinson et. al. Then it's just Guido.
Oh, I think that everybody agrees on moving to Unicode as basic text storage container. The question is how to get there ;-) Today we are facing a problem in that strings are also used as containers for binary data and no distinction is made between the two. We also have to watch out for external interfaces which still use 8-bit character data, so there's a lot ahead.
Some things you may be missing though is that Python already has support for a few features you mention, e.g. codecs.open() provide more or less what you have in mind with fopen() and the compiler can already unify Unicode and string literals using the -U command line option.
The problem with unifying string literals without unifying string *types* is that many functions probably check for and type("") not type(u"").
Well, with -U on, Python will compile "" into u"", so you can already test Unicode compatibility today... last I tried, Python didn't even start up :-(
What you don't talk about in the PEP is that Python's stdlib isn't even Unicode aware yet, and whatever unification steps we take, this project will have to preceed it.
I'm not convinced that is true. We should be able to figure it out quickly though.
We can use that knowledge to base future design upon. The problem with many stdlib modules is that they don't make a difference between text and binary data (and often can't, e.g. take sockets), so we'll have to figure out a way to differentiate between the two. We'll also need an easy-to-use binary data type -- as you mention in the PEP, we could take the old string implementation as basis and then perhaps turn u"" into "" and use b"" to mean what "" does now (string object).
The problem with making the stdlib Unicode aware is that of deciding which parts deal with text data or binary data -- the code sometimes makes assumptions about the nature of the data and at other times it simply doesn't care.
Can you give an example? If the new string type is 100% backwards compatible in every way with the old string type then the only code that should break is silly code that did stuff like:
try: something = chr( somethingelse ) except ValueError: print "Unicode is evil!"
Note that I expect types.StringType == types(chr(10000)) etc.
Sure, but there are interfaces which don't differentiate between text and binary data, e.g. many IO-operations don't care about what exactly they are writing or reading. We'd probably define a new set of text data APIs (meaning methods) to make this difference clear and visible, e.g. .writetext() and .readtext().
In this light I think you ought to focus Python 3k with your PEP. This will also enable better merging techniques due to the lifting of the type/class difference.
Python3K is a beautiful dream but we have problems we need to solve today. We could start moving to a Unicode future in baby steps right now. Your "open" function could be moved into builtins as "fopen". Python's "binary" open function could be deprecated under its current name and perhaps renamed.
Hmm, I'd prefer to keep things separate for a while and then switch over to new APIs once we get used to them.
The sooner we start the sooner we finish. You and /F laid some beautiful groundwork. Now we just need to keep up the momentum. I think we can do this without a big backwards compatibility earthquake. VB and TCL figured out how to do it...
... and we should probably try to learn from them. They have put a considerable amount of work into getting the low-level interfacing issues straight. It would be nice if we could avoid adding more conversion magic... -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/