
Just van Rossum writes:
The main concept is not to provide a new string type but to extend the existing string object like so:
This is the most logical thing to do.
- wide strings are stored as if they were narrow strings, simply using two bytes for each Unicode character.
I disagree with you here... store them as UTF-8.
- there's a flag that specifies whether the string is narrow or wide.
Yup.
- the ob_size field is the _physical_ length of the data; if the string is wide, len(s) will return ob_size/2, all other string operations will have to do similar things.
Is it possible to add a logical length field too? I presume it is too expensive to recalculate the logical (character) length of a string each time len(s) is called? Doing this is only slightly more time consuming than a normal strlen: really just O(n) + c, where 'c' is the constant time needed for table lookup (to get the number of bytes in the UTF-8 sequence given the start character) and the pointer manipulation (to add that length to your span pointer).
- there can possibly be an encoding attribute which may specify the used encoding, if known.
So is this used to handle the case where you have a legacy encoding (ShiftJIS, say) used in your existing strings, so you flag that 8-bit ("narrow" in a way) string as ShiftJIS? If wide strings are always Unicode, why do you need the encoding?
Admittedly, this is tricky and involves quite a bit of effort to implement, since all string methods need to have narrow/wide switch. To make it worse, it hardly offers anything the current solution doesn't. However, it offers one IMHO _big_ advantage: C code that just passes strings along does not need to change: wide strings can be seen as narrow strings without any loss. This allows for __str__() & str() and friends to work with unicode strings without any change.
If you store wide strings as UCS2 then people using the C interface lose: strlen() stops working, or will return incorrect results. Indeed, any of the str*() routines in the C runtime will break. This is the advantage of using UTF-8 here --- you can still use strcpy and the like on the C side and have things work.
Any thoughts?
I'm doing essentially what you suggest in my Unicode enablement of MySQL. -tree -- Tom Emerson Basis Technology Corp. Language Hacker http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever"