
Today I had a relatively simple idea that unites wide strings and narrow strings in a way that is more backward comatible at the C level. It's quite possible this has already been considered and rejected for reasons that are not yet obvious to me, but I'll give it a shot anyway. The main concept is not to provide a new string type but to extend the existing string object like so: - wide strings are stored as if they were narrow strings, simply using two bytes for each Unicode character. - there's a flag that specifies whether the string is narrow or wide. - the ob_size field is the _physical_ length of the data; if the string is wide, len(s) will return ob_size/2, all other string operations will have to do similar things. - there can possibly be an encoding attribute which may specify the used encoding, if known. Admittedly, this is tricky and involves quite a bit of effort to implement, since all string methods need to have narrow/wide switch. To make it worse, it hardly offers anything the current solution doesn't. However, it offers one IMHO _big_ advantage: C code that just passes strings along does not need to change: wide strings can be seen as narrow strings without any loss. This allows for __str__() & str() and friends to work with unicode strings without any change. Any thoughts? Just

Just van Rossum writes:
The main concept is not to provide a new string type but to extend the existing string object like so:
This is the most logical thing to do.
- wide strings are stored as if they were narrow strings, simply using two bytes for each Unicode character.
I disagree with you here... store them as UTF-8.
- there's a flag that specifies whether the string is narrow or wide.
Yup.
- the ob_size field is the _physical_ length of the data; if the string is wide, len(s) will return ob_size/2, all other string operations will have to do similar things.
Is it possible to add a logical length field too? I presume it is too expensive to recalculate the logical (character) length of a string each time len(s) is called? Doing this is only slightly more time consuming than a normal strlen: really just O(n) + c, where 'c' is the constant time needed for table lookup (to get the number of bytes in the UTF-8 sequence given the start character) and the pointer manipulation (to add that length to your span pointer).
- there can possibly be an encoding attribute which may specify the used encoding, if known.
So is this used to handle the case where you have a legacy encoding (ShiftJIS, say) used in your existing strings, so you flag that 8-bit ("narrow" in a way) string as ShiftJIS? If wide strings are always Unicode, why do you need the encoding?
Admittedly, this is tricky and involves quite a bit of effort to implement, since all string methods need to have narrow/wide switch. To make it worse, it hardly offers anything the current solution doesn't. However, it offers one IMHO _big_ advantage: C code that just passes strings along does not need to change: wide strings can be seen as narrow strings without any loss. This allows for __str__() & str() and friends to work with unicode strings without any change.
If you store wide strings as UCS2 then people using the C interface lose: strlen() stops working, or will return incorrect results. Indeed, any of the str*() routines in the C runtime will break. This is the advantage of using UTF-8 here --- you can still use strcpy and the like on the C side and have things work.
Any thoughts?
I'm doing essentially what you suggest in my Unicode enablement of MySQL. -tree -- Tom Emerson Basis Technology Corp. Language Hacker http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever"

Tom> Is it possible to add a logical length field too? I presume it is Tom> too expensive to recalculate the logical (character) length of a Tom> string each time len(s) is called? Doing this is only slightly more Tom> time consuming than a normal strlen: ... Note that currently the len() method doesn't call strlen() at all. It just returns the ob_size field. Presumably, with Just's proposal len() would simply return ob_size/width. If you used a variable width encoding, Just's plan wouldn't work. (I don't know anything about string encodings - is UTF-8 variable width?)

Skip Montanaro writes:
Note that currently the len() method doesn't call strlen() at all. It just returns the ob_size field. Presumably, with Just's proposal len() would simply return ob_size/width. If you used a variable width encoding, Just's plan wouldn't work. (I don't know anything about string encodings - is UTF-8 variable width?)
Yes, technically from 1 - 6 bytes per character, though in practice for Unicode it's 1 - 3. -tree -- Tom Emerson Basis Technology Corp. Language Hacker http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever"

Today I had a relatively simple idea that unites wide strings and narrow strings in a way that is more backward comatible at the C level. It's quite possible this has already been considered and rejected for reasons that are not yet obvious to me, but I'll give it a shot anyway.
The main concept is not to provide a new string type but to extend the existing string object like so: - wide strings are stored as if they were narrow strings, simply using two bytes for each Unicode character. - there's a flag that specifies whether the string is narrow or wide. - the ob_size field is the _physical_ length of the data; if the string is wide, len(s) will return ob_size/2, all other string operations will have to do similar things. - there can possibly be an encoding attribute which may specify the used encoding, if known.
Admittedly, this is tricky and involves quite a bit of effort to implement, since all string methods need to have narrow/wide switch. To make it worse, it hardly offers anything the current solution doesn't. However, it offers one IMHO _big_ advantage: C code that just passes strings along does not need to change: wide strings can be seen as narrow strings without any loss. This allows for __str__() & str() and friends to work with unicode strings without any change.
This seems to have some nice properties, but I think it would cause problems for existing C code that tries to *interpret* the bytes of a string: it could very well do the wrong thing for wide strings (since old C code doesn't check for the "wide" flag). I'm not sure how much C code there is that merely passes strings along... Most C code using strings makes use of the strings (e.g. open() falls in this category in my eyes). --Guido van Rossum (home page: http://www.python.org/~guido/)

(Thanks for all the comments. I'll condense my replies into one post.) [JvR]
- wide strings are stored as if they were narrow strings, simply using two bytes for each Unicode character.
[Tom Emerson wrote]
I disagree with you here... store them as UTF-8.
Erm, utf-8 in a wide string? This makes no sense... [Skip Montanaro]
Presumably, with Just's proposal len() would simply return ob_size/width.
Right. And if you would allow values for width other than 1 and 2, it opens the way for UCS-4. Wouldn't that be nice? It's hardly more effort, and "only" width==1 needs to be special-cased for speed.
If you used a variable width encoding, Just's plan wouldn't work.
Correct, but nor does the current unicode object. Variable width encodings are too messy to see as strings at all: they are only useful as byte arrays. [GvR]
This seems to have some nice properties, but I think it would cause problems for existing C code that tries to *interpret* the bytes of a string: it could very well do the wrong thing for wide strings (since old C code doesn't check for the "wide" flag). I'm not sure how much C code there is that merely passes strings along... Most C code using strings makes use of the strings (e.g. open() falls in this category in my eyes).
There are probably many cases that fall into this category. But then again, these cases, especially those that potentially can deal with other encodings than ascii, are not much helped by a default encoding, as /F showed. My idea arose after yesterday's discussions. Some quotes, plus comments: [GvR]
However the problem is that print *always* first converts the object using str(), and str() enforces that the result is an 8-bit string. I'm afraid that loosening this will break too much code. (This all really happens at the C level.)
Guido goes on to explain that this means utf-8 is the only sensible default in this case. Good reasoning, but I think it's backwards: - str(unicodestring) should just return unicodestring - it is important that stdout receives the original unicode object. [MAL]
BTW, __str__() has to return strings too. Perhaps we need __unicode__() and a corresponding slot function too ?!
This also seems backwards. If it's really too hard to change Python so that __str__ can return unicode objects, my solution may help. [Ka-Ping Yee]
Here is an addendum that might actually make that proposal feasible enough (compatibility-wise) to fly in the short term:
print x
does, conceptually:
try: sys.stdout.printout(x) except AttributeError: sys.stdout.write(str(x)) sys.stdout.write("\n")
That stuff like this is even being *proposed* (not that it's not smart or anything...) means there's a terrible bottleneck somewhere which needs fixing. My proposal seems to do does that nicely. Of course, there's no such thing as a free lunch, and I'm sure there are other corners that'll need fixing, but it appears having to write if (!PyString_Check(doc) && !PyUnicode_Check(doc)) ... in all places that may accept unicode strings is no fun either. Yes, some code will break if you throw a wide string at it, but I think that code is easier repaired with my proposal than with the current implementation. It's a big advantage to have only one string type; it makes many problems we've been discussing easier to talk about. Just

I wrote:
It's a big advantage to have only one string type; it makes many problems we've been discussing easier to talk about.
I think I should've been more explicit about what I meant here. I'll try to phrase it as an addendum to my proposal -- which suddenly is no longer just a narrow/wide string unification but narrow/wide/ultrawide, to really be ready for the future... As someone else suggested in the discussion, I think it's good if we separate the encoding from the data type. Meaning that wide strings are no longer tied to Unicode. This allows for double-byte encodings other than UCS-2 as well as for safe passing-through of binary goop, but that's not the main point. The main point is that this will make the behavior of (wide) strings more understandable and consistent. The extended string type is simply a sequence of code points, allowing for 0-0xFF for narrow strings, 0-0xFFFF for wide strings, and 0-0xFFFFFFFF for ultra-wide strings. Upcasting is always safe, downcasting may raise OverflowError. Depending on the used encoding, this comes as close as possible to the sequence-of-characters model. The default character set should of course be Unicode -- and it should be obvious that this implies Latin-1 for narrow strings. (Additionally: an encoding attribute suddenly makes a whole lot of sense again.) Ok, y'all can shoot me now ;-) Just
participants (4)
-
Guido van Rossum
-
Just van Rossum
-
Skip Montanaro
-
Tom Emerson