[I18n-sig] Re: How does Python Unicode treat surrogates?

Tim Peters tim.one@home.com
Mon, 25 Jun 2001 23:52:24 -0400


[Guido]
> But UTF-16 vs. UCS-4 is not an implementation detail!

[Gaute B Strokkenes]
> Sure it is!  A given chunk of Unicode data is semantically just a
> finite sequence of Unicode scalar values.  The difference between
> UTF-16 and UCS-4 is entirely one of how you are arranging bits and
> bytes to store the same information.  The meaning is exactly the same;
> so it's an implementation detail.

I don't know what possessed Guido to make that claim, but I'm sure he'll
agree after some thought (he must, because you're right <wink>).

Something else is bothering me here, though:  Python isn't C, or even Java,
so a slicing gimmick returning raw encoding bytes (call 'em octets if you
must, but they're bytes to me <wink>) favored by Unicode *implementors* is
at the wrong level.  Unicode *users* can't paste this crap together again
efficiently using Python code, because high-volume low-level bit-fiddling is
exactly what Python code is worst at.  So the idea that u[i] (for a Unicode
string u and int i) should ever return meaningless binary blobs at the
*Python* level is just astonishing to me:  Unicode strings in Python are an
end-user feature, not a low-level crutch for Unicode library developers.