[I18n-sig] Re: Unicode surrogates: just say no!

Gaute B Strokkenes gs234@cam.ac.uk
27 Jun 2001 08:52:44 +0100

[I'm CC-ing the unicode list again because I'm doing some fairly
sophisticated interpretation of the Unicode conformance requirements
below and I'd like to have someone with more experience with this
check my reasoning.]

On Wed, 27 Jun 2001, martin@loewis.home.cs.tu-berlin.de wrote:
>> This is wrong.  It is a bug to encode a non-BMP character with six
>> bytes by pretending that the (surrogate) values used in the UTF-16
>> representation are BMP characters and encoding the character as
>> though it was a string consisting of that character.  It is also a
>> bug to interpret such a six-byte sequence as a single character.
>> This was clarified in Unicode 3.1.
> It seems to be unclear to many, including myself, what exactly was
> clarified with Unicode 3.1.

See the section called "UTF-8 Corrigendum" in TR 27.  It explains it
all in detail.

> Where exactly does it say that processing a six-byte two-surrogates
> sequence as a single character is non-conforming?

See D39(c) at <http://www.unicode.org/unicode/reports/tr27>.  This
defines such a six-byte sequence as an "irregular UTF-8 code unit
sequence" and goes on to state that, as a consequence of C12,
conforminig processes are not allowed to generate such sequences.
This really ought to be obvious anyway: UTF-8 is defined to represent
a given USV with 1 to 4 bytes, so clearly 6 is not possible.

Conversely, C12(a) states that a conformant process can not produce
"ill-formed code unit sequences" while producing data in a UTF.  The
definition of this term is given in D30 as a code unit sequence that
can not be produced from a sequence of unicode scalar values.  This is
where things get somewhat more interesting.  Somewhat surprisingly,
the definition of "Unicode Scalar Value" has not been changed from 3.0
to 3.1.  The reason why one might expect this to have changed is that
in 3.0 UTF-16 was "the" unicode format, so that USVs were defined in
terms of UTF-16 code points.  In 3.1 it is stated elsewhere that
different UTFs are simply conrete ways to store sequences of USVs.
However, the definition of USV is still

  either: A value in the range 0 - 0xFFFF which is is not a high or
  low surrogate in UTF-16,

  or: a value in the range 0x10000 - 0x10FFFF which is obtained by
  taking a pair of values that form a high and low surrogate
  respectively in UTF-16 and applying the usual formula.

Since there is no way you can form a value in the range 0xD800 -
0xDFFF in this fashion it follows that a USV can not be in this range.
Therefore you are not allowed to create a 3 byte sequence that is the
UTF-8 encoding of value in this range.  Therefore you are not allowed
to generate pairs of such sequences either.

I hope this is all clear.

One very important thing to keep in mind when doing this stuff is that
3.1 is a brand new standard, less than one and a half months old.  A
consequence of this is that most of the material on the Unicode web
site still refers to version 3.0, so you have to be very careful to
check that the information you're looking at is in fact up to date.
(The only updated information I could find was TR 27 and [probably]
the data tables.)

> What exactly does it say that the conforming behaviour should be?

Argh.  Treat it as an error, probably.  You go and read the standard
yourself, my head is already hurting.  8-)

>> Personally, I think that the codecs should report an error in the
>> appropriate fashion when presented with a python unicode string
>> which contains values that are not allowed, such as lone
>> surrogates.
> Other people have read Unicode 3.1 and came to the conclusion that
> it mandates that implementations accept such a character...

Well, they're wrong.  The standard is clear as ink in this regard.

Big Gaute                               http://www.srcf.ucam.org/~gs234/
I can't think about that.  It doesn't go with HEDGES in the shape of