[Python-Dev] [I18n-sig] Unicode strings: an alternative
Tom Emerson
tree@basistech.com
Wed, 3 May 2000 16:19:05 -0400 (EDT)
Just van Rossum writes:
> The main concept is not to provide a new string type but to extend the
> existing string object like so:
This is the most logical thing to do.
> - wide strings are stored as if they were narrow strings, simply using two
> bytes for each Unicode character.
I disagree with you here... store them as UTF-8.
> - there's a flag that specifies whether the string is narrow or wide.
Yup.
> - the ob_size field is the _physical_ length of the data; if the string is
> wide, len(s) will return ob_size/2, all other string operations will have
> to do similar things.
Is it possible to add a logical length field too? I presume it is too
expensive to recalculate the logical (character) length of a string
each time len(s) is called? Doing this is only slightly more time
consuming than a normal strlen: really just O(n) + c, where 'c' is the
constant time needed for table lookup (to get the number of bytes in
the UTF-8 sequence given the start character) and the pointer
manipulation (to add that length to your span pointer).
> - there can possibly be an encoding attribute which may specify the used
> encoding, if known.
So is this used to handle the case where you have a legacy encoding
(ShiftJIS, say) used in your existing strings, so you flag that 8-bit
("narrow" in a way) string as ShiftJIS?
If wide strings are always Unicode, why do you need the encoding?
> Admittedly, this is tricky and involves quite a bit of effort to implement,
> since all string methods need to have narrow/wide switch. To make it worse,
> it hardly offers anything the current solution doesn't. However, it offers
> one IMHO _big_ advantage: C code that just passes strings along does not
> need to change: wide strings can be seen as narrow strings without any
> loss. This allows for __str__() & str() and friends to work with unicode
> strings without any change.
If you store wide strings as UCS2 then people using the C interface
lose: strlen() stops working, or will return incorrect
results. Indeed, any of the str*() routines in the C runtime will
break. This is the advantage of using UTF-8 here --- you can still use
strcpy and the like on the C side and have things work.
> Any thoughts?
I'm doing essentially what you suggest in my Unicode enablement of MySQL.
-tree
--
Tom Emerson Basis Technology Corp.
Language Hacker http://www.basistech.com
"Beware the lollipop of mediocrity: lick it once and you suck forever"