[I18n-sig] Re: validity of lone surrogates (was Re: Unicode surroga tes: just say no!)

Kenneth Whistler kenw@sybase.com
Wed, 27 Jun 2001 12:23:47 -0700 (PDT)


Mark Davis wrote:

> Your are correct in that the text is not nearly as clear as it should be,
> and is open to different interpretations. My view of the status in Unicode
> 3.1 is represented on http://www.macchiato.com/utc/utf_comparison.htm.
> Corresponding computations are on
> http://www.macchiato.com/utc/utf_computations.htm.

I concur in general with Mark's characterization of what the current
text is intended to say. In particular, Mark is correct that there
is language just below D29 that says that "a UTF mapping *must also*
map invalid Unicode scalar values to unique code value sequences. These
invalid scalar values include FFFE, FFFF, and unpaired surrogates."

I strongly agree with Mark that this is the correct position to
take with respect to the *noncharacters*, i.e. FFFE, FFFF (and their
ilk on the supplementary planes, as well as the newly defined
FDD0..FDFF). In this respect, ISO/IEC 10646 is inconsistent in
its definition of UTF-8, and needs to be fixed.

However, like Gaute, I think there are logical contradictions in the
current text of the Unicode Standard when it comes to dealing with the
isolated surrogate code points.

Gaute is also correct that much of the problem of textual interpretation
results from the incomplete transition in Unicode 3.0 from thinking of
UTF-16 as Unicode, with UTF-8 derived from UTF-16, to UTF-16 and UTF-8 as
coequal transforms from the Unicode Scalar Value. The UTC editorial
committee struggled with that text, but also attempted to minimize
the overall impact on Chapter 3 of the standard. In retrospect, it
probably would have been better to take the hit then and completely
rewrite Chapter 3 in terms of the new model, because of the continuing
confusion that the incomplete transition has obviously engendered among
implementers.

> 
> One of the goals for Unicode 4.0 is to clear up the text describing UTFs in
> particular, which may change some of the edge cases (isolates and/or
> irregulars).

This work is actively underway. I can guarantee that the Unicode 4.0
text will be *much* clearer about all these issues.

However, the UTC editorial committee is still struggling with exactly
how to present the edge cases.

It is my *personal* opinion -- and not yet one that could be stated
to be consensus in UTC or the UTC editorial committee -- that
the Unicode Standard should adopt formal definitions similar to
that of the IETF, where isolated surrogates and/or irregular sequences
are just ill-formed, period. And where the issues of lenient interpretation
of irregular UTF-8 generated by older implementations are shunted
off into a migration strategy section dealing with UTF converters.

--Ken Whistler