[I18n-sig] Re: How does Python Unicode treat surrogates?

26 Jun 2001 04:06:07 +0100

On Mon, 25 Jun 2001, JMachin@Colonial.com.au wrote:
> MAL and Gaute,
> 
> Can I please take the middle ground (and risk having both of you
> throw things at me?
> 
> => Lone surrogates are not 'true Unicode char points
>  in their own right' [MAL] -- they don't represent characters. 

I think you're misquoting MAL; the "not" was not there in his original
statement.

> On the other hand, UTF code sequences that would decode into lone
> surrogates are not "illegal".  Please read clause D29 in section 3.8
> of the Unicode 3.0 standard. This is further clarified by Unicode
> 3.1 which expressly lists legal UTF-8 sequences; these encompass
> lone surrogates.

This is really a different issue.  The paragraph states that the
various UTFs have the property that they can transform any sequence of
scalar values in the range 0 - 0x10FFFF to whatever representation is
mandated by the UTF and then back again in a bijective fashion--even
when the sequence includes scalars that are not Unicode characters,
such as 0xFFFF, 0xFFFE and the various values that are reserved to
contain UTF-16 surrogates.  Personally, I'm having difficulty seeing
how this statement could possibly apply to UTF-16.  (For instance, I
don't see how it would be possible to encode a sequence of unicode
scalar values corresponding to a low and a high surrogate; if you
tried to map this back then you would get a single unicode scalar
value outside of the BMP).  Perhaps someone on the unicode list could
elaborate?

My personal theory is that this is a vestige of the days when
"Unicode" meant "16-bit characters" and all UTFs other than UTF-16
were just hacks that one was supposed to use for compatibility reasons
only.  Eventually someone realised that 16 bits wasn't going to be
enough after all, and so kludges like surrogates were invented.  It is
instructive in this regard to note how the Unicode 3.0 conformance
requirements effectively state that "thou shalt use 16-bit
characters"; the paragraph stating that using UCS-4 for the wchar_t
type in ISO C (this is what glibc does) is not Unicode conformant is
particularly amusing.  This was all changed for 3.1.

-- 
Big Gaute                               http://www.srcf.ucam.org/~gs234/
..  here I am in 53 B.C. and all I want is a dill pickle!!