
(Thanks for all the comments. I'll condense my replies into one post.) [JvR]
- wide strings are stored as if they were narrow strings, simply using two bytes for each Unicode character.
[Tom Emerson wrote]
I disagree with you here... store them as UTF-8.
Erm, utf-8 in a wide string? This makes no sense... [Skip Montanaro]
Presumably, with Just's proposal len() would simply return ob_size/width.
Right. And if you would allow values for width other than 1 and 2, it opens the way for UCS-4. Wouldn't that be nice? It's hardly more effort, and "only" width==1 needs to be special-cased for speed.
If you used a variable width encoding, Just's plan wouldn't work.
Correct, but nor does the current unicode object. Variable width encodings are too messy to see as strings at all: they are only useful as byte arrays. [GvR]
This seems to have some nice properties, but I think it would cause problems for existing C code that tries to *interpret* the bytes of a string: it could very well do the wrong thing for wide strings (since old C code doesn't check for the "wide" flag). I'm not sure how much C code there is that merely passes strings along... Most C code using strings makes use of the strings (e.g. open() falls in this category in my eyes).
There are probably many cases that fall into this category. But then again, these cases, especially those that potentially can deal with other encodings than ascii, are not much helped by a default encoding, as /F showed. My idea arose after yesterday's discussions. Some quotes, plus comments: [GvR]
However the problem is that print *always* first converts the object using str(), and str() enforces that the result is an 8-bit string. I'm afraid that loosening this will break too much code. (This all really happens at the C level.)
Guido goes on to explain that this means utf-8 is the only sensible default in this case. Good reasoning, but I think it's backwards: - str(unicodestring) should just return unicodestring - it is important that stdout receives the original unicode object. [MAL]
BTW, __str__() has to return strings too. Perhaps we need __unicode__() and a corresponding slot function too ?!
This also seems backwards. If it's really too hard to change Python so that __str__ can return unicode objects, my solution may help. [Ka-Ping Yee]
Here is an addendum that might actually make that proposal feasible enough (compatibility-wise) to fly in the short term:
print x
does, conceptually:
try: sys.stdout.printout(x) except AttributeError: sys.stdout.write(str(x)) sys.stdout.write("\n")
That stuff like this is even being *proposed* (not that it's not smart or anything...) means there's a terrible bottleneck somewhere which needs fixing. My proposal seems to do does that nicely. Of course, there's no such thing as a free lunch, and I'm sure there are other corners that'll need fixing, but it appears having to write if (!PyString_Check(doc) && !PyUnicode_Check(doc)) ... in all places that may accept unicode strings is no fun either. Yes, some code will break if you throw a wide string at it, but I think that code is easier repaired with my proposal than with the current implementation. It's a big advantage to have only one string type; it makes many problems we've been discussing easier to talk about. Just