[I18n-sig] Re: How does Python Unicode treat surrogates?
Gaute B Strokkenes
gs234@cam.ac.uk
26 Jun 2001 04:24:27 +0100
On Mon, 25 Jun 2001, guido@digicool.com wrote:
>> No problem... we can change to 4 byte values too if the world
>> agrees on 4 bytes per character. However, 2 bytes or 4 bytes
>> is an implementation detail and not part of the Unicode standard
>> itself.
>
> But UTF-16 vs. UCS-4 is not an implementation detail!
Sure it is! A given chunk of Unicode data is semantically just a
finite sequence of Unicode scalar values. The difference between
UTF-16 and UCS-4 is entirely one of how you are arranging bits and
bytes to store the same information. The meaning is exactly the same;
so it's an implementation detail.
A (somewhat far-fetched, but there you are) analogy is this: imagine
that you wish to store a true-colour bitmap in memory. You could do
this by, say, storing the R, G and B components of a given pixel right
next to each other, in that order. Alternatively, you could keep all
the R components in one chunk and all the G components in another, or
you could store the pixels in a different order. All of this makes no
difference to the actual bitmap itself.
I hope you see what I mean.
> If we store 4 bytes per character, we should treat surrogates
> differently. I don't know where those would be converted --
> probably in the UTF-16 to UCS-4 codec.
An important point here is that the sole raison d'etre of surrogates
is to enable one to store the entire 21-bit Unicode character set
within the confines of a 16-bit encoding. If you're not dealing with
UTF-16, surrogates quite simply do not exist and the only time you
have to worry about them is when and if you wish to convert to and
from UTF-16. As such the statement "we should treat surrogates
differently when storing four bytes per character" is rather
imprecise; the whole point is that you don't treat or worry about
surrogates at all; except during conversion to and from UTF-16,
obviously.
--
Big Gaute http://www.srcf.ucam.org/~gs234/
I have nostalgia for the late Sixties! In 1969 I left my laundry with
a hippie!! During an unauthorized Tupperware party it was chopped &
diced!