[I18n-sig] Re: How does Python Unicode treat surrogates?
Tim Peters
tim.one@home.com
Mon, 25 Jun 2001 23:52:24 -0400
[Guido]
> But UTF-16 vs. UCS-4 is not an implementation detail!
[Gaute B Strokkenes]
> Sure it is! A given chunk of Unicode data is semantically just a
> finite sequence of Unicode scalar values. The difference between
> UTF-16 and UCS-4 is entirely one of how you are arranging bits and
> bytes to store the same information. The meaning is exactly the same;
> so it's an implementation detail.
I don't know what possessed Guido to make that claim, but I'm sure he'll
agree after some thought (he must, because you're right <wink>).
Something else is bothering me here, though: Python isn't C, or even Java,
so a slicing gimmick returning raw encoding bytes (call 'em octets if you
must, but they're bytes to me <wink>) favored by Unicode *implementors* is
at the wrong level. Unicode *users* can't paste this crap together again
efficiently using Python code, because high-volume low-level bit-fiddling is
exactly what Python code is worst at. So the idea that u[i] (for a Unicode
string u and int i) should ever return meaningless binary blobs at the
*Python* level is just astonishing to me: Unicode strings in Python are an
end-user feature, not a low-level crutch for Unicode library developers.