[I18n-sig] Re: Unicode debate

M.-A. Lemburg mal@lemburg.com
Sat, 29 Apr 2000 15:25:47 +0200

Just van Rossum wrote:
> At 9:51 PM +0200 28-04-2000, M.-A. Lemburg wrote:
> >Right. Binary data in such a string literal would have to
> >use str('...data...','binary') to get the correct encoding
> >attached to it.
> And that sucks.

Not sure why... after all the point of adding encoding information
to strings was to add missing information: the current usage
as binary data container would then be justified provided the
strings are marked as containing binary data.

> I stick to my point that the encoding attr should *not* be
> used when dealing strictly with  bit strings. Ever. At all. Its' *only*
> purpose is to aid "upcasting" to unicode. (But maybe that purpose is too
> weak to warrant an entirely new attribute...)

I think the little experiment with adding an encoding attribute
to strings is not going to be the right solution. People will
get all confused, the implementation won't be able make much
use of it without proper forarding of the information and that
forwarding costs performance even for those programs which do
not need this at all.

Guido's suggestion is more practical: either go all the way
(meaning to write all *text* as Unicode objects) or don't
use Unicode at all.

Note that the patch I sent to the patches list enables you
to test the "go all the way" strategy in an even more radical
way: it converts all "..." strings to u"..." when the -U
command line option is given.

I think we should use the experience gained with that patch
to make the standard Python library (and the interpreter)
Unicode capable.

Here's a list of what I've found by running some of the
regression tests:

* import string fails due to the way _idtable is constructed
* getattr() doesn't like Unicode as second argument, same for
  delattr() and hasattr()
* eval() expects a string object
* there still are some string exceptions around in the regr.
  tests which cause a failure (Unicode exceptions don't work)
* struct.pack('s') doesn't like Unicode as argument
* re doesn't work: pcre_expand() needs a string object
* regex doesn't work either because string objects are hard-coded
* mmap doesn't like Unicode: "mmap assignment must be
  single-character string"
* cPickle.loads() doesn't like Unicode as data storage
* keywords must be strings (f(1, 2, 3, **{'a':4, 'b':5}) doesn't work)
* rotor doesn't work

Some of these could be fixed by putting a str() call around
the '...' constants. Others need fixes in C code. Yet others
would be better off if they used the buffer interfaces (basically
all APIs which work on raw data like cPickle or rotor).

Marc-Andre Lemburg
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/