[I18n-sig] How does Python Unicode treat surrogates?
Tom Emerson
tree@basistech.com
Mon, 25 Jun 2001 13:13:56 -0400
Guido van Rossum writes:
> I'm sorry, but I don't see why it's UCS-2 any more or less than
> UTF-16. That's like arguing whether 8-bit strings contains ASCII or
> UTF-8. That's up to the application; the data type can be used for
> either.
UCS-2 and UTF-16 and UTF-8 are encoding forms of Unicode. Unicode
defines characters using an abstract integer, the code-point. As of
Unicode 3.1 code points range from 0x000000 to 0x10FFFF.
The so-called Unicode string type in Python is a wide-string type,
where each character is treated as a 16-bit quantity. The
interpretation placed on those 16-bit quantities is that of UCS-2. In
that case each half of a surrogate pair is an unknown character.
As soon as you impose UTF-16 semantics on the 16-bit quantities, then
you need to treat surrogate pairs as a single character.
If the implementation won't change, then the standard library needs to
support surrogates as a wrapper: leaving it up to each application is
a mistake. IMHO you cannot trust implementers to do this right.
> But unless I misunderstand what it *is* that you are suggesting, the
> O(1) indexing property can't be retained with your suggestion, and
> that's out of the question.
You understand me completely. Adding transparent UTF-16 support
changes your O(1) indexing operation to O(1+c), where 'c' is the small
amount of time required to check for the surrogate. Granted, this 'c'
could get large, but...
But I see your point: this requirement is what prompted the glibc
folks to go with the 32-bit wchar_t type.
> That turned out to be a myth, actually. mod_python works fine with
> threads on most platforms.
Not in my experience. On my FreeBSD box Python 2.0 built with threads
does not get along in some cases where Apache 1.3.19. Not that it matters.
--
Tom Emerson Basis Technology Corp.
Sr. Sinostringologist http://www.basistech.com
"Beware the lollipop of mediocrity: lick it once and you suck forever"