[Python-Dev] Internationalization Toolkit

Tim Peters tim_one@email.msn.com
Thu, 11 Nov 1999 01:49:16 -0500


[ Greg Stein]
> ...
> Things will be a lot faster if we have a fixed-size character. Variable
> length formats like UTF-8 are a lot harder to slice, search, etc.

The initial byte of any UTF-8 encoded character never appears in a
*non*-initial position of any UTF-8 encoded character.  Which means
searching is not only tractable in UTF-8, but also that whatever optimized
8-bit clean string searching routines you happen to have sitting around
today can be used as-is on UTF-8 encoded strings.  This is not true of UCS-2
encoded strings (in which "the first" byte is not distinguished, so 8-bit
search is vulnerable to finding a hit starting "in the middle" of a
character).  More, to the extent that the bulk of your text is plain ASCII,
the UTF-8 search will run much faster than when using a 2-byte encoding,
simply because it has half as many bytes to chew over.

UTF-8 is certainly slower for random-access indexing, including slicing.

I don't know what "etc" means, but if it follows the pattern so far,
sometimes it's faster and sometimes it's slower <wink>.

> (IMO) a big reason for this new type is for interaction with the
> underlying OS/platform. I don't know of any platforms right now that
> really use UTF-8 as their Unicode string representation (meaning we'd
> have to convert back/forth from our UTF-8 representation to talk to the
> OS).

No argument here.