[I18n-sig] Re: Unicode surrogates: just say no!
Martin v. Loewis
martin@loewis.home.cs.tu-berlin.de
Wed, 27 Jun 2001 19:06:30 +0200
> Martin v. Loewis" <martin@loewis.home.cs.tu-berlin.de> wrote:
>
> > It seems to be unclear to many, including myself, what exactly was
> > clarified with Unicode 3.1. Where exactly does it say that processing
> > a six-byte two-surrogates sequence as a single character is
> > non-conforming?
>
> It's not non-conforming, it's "irregular".
If some implementation processes something, it can be either
conforming or non-conforming doing so, no? The byte sequence itself
may be irregular; I'm asking how a conforming implementation should
deal with it when it sees it.
> Please read the technical report (#27) that I pointed at yesterday
> (on the i18n-sig@python). It gives detailed specifications for
> UTF-8. Anything not in the table "UTF-8 Bit Distribution" and
> accompanying description shown there is non-conforming.
I see conformant/non-conformant (*) only used for implementations (and
processes), not for byte sequences. There you use illegal, ill-formed,
irregular; much of my confusion probably is because I don't know how
these terms relate, except for
- an irregular sequence (of bytes, or code units) is not illegal.
Also, I assume that negation of these concepts follows the English
language rules (i.e. "not illegal" == "legal", "not ill-formed" ==
"well-formed", etc)
> In other words, it is non-conforming to generate two 3-byte things for a
> surrogate pair. However, it remains "legal but irregular" to interpret
> such a pair of 3-byte entities.
[...]
> If you still find the definitions and discussion in the technical report
> to be unclear, then the Unicode editorial committee would undoubtedly like
> to hear about it.
The issue of UTF-8 encoded surrogate pairs is clear now to me, I hope:
You must not write them, but you may read them.
The next question then is what to do with lone surrogate triplets; the
table in TR 27 suggests they are legal, but people on this list have
argued they must neither be emitted nor consumed (since what you get
is not a legal USV).
Thanks for your comments,
Martin
(*) "Conforming" is never used, sorry for the confusion