[I18n-sig] Re: Unicode surrogates: just say no!

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Wed, 27 Jun 2001 14:04:18 +0200


> >> It is also a
> >> bug to interpret such a six-byte sequence as a single character.
> >> This was clarified in Unicode 3.1.
> > 
> > It seems to be unclear to many, including myself, what exactly was
> > clarified with Unicode 3.1.
> 
> See the section called "UTF-8 Corrigendum" in TR 27.  It explains it
> all in detail.

I've read this section forth an back over and over again, admittedly
without having a copy of Unicode 3.0 at hand to mentally apply the
changes.

> > Where exactly does it say that processing a six-byte two-surrogates
> > sequence as a single character is non-conforming?
> 
> See D39(c) at <http://www.unicode.org/unicode/reports/tr27>.  This
> defines such a six-byte sequence as an "irregular UTF-8 code unit
> sequence" and goes on to state that, as a consequence of C12,
> conforminig processes are not allowed to generate such sequences.

[I guess this is D36(c)]
Yes, but you've claimed that one *also* must not interpret such a
sequence as a single character - this only says that you must never
generate such a sequence.

> Therefore you are not allowed to create a 3 byte sequence that is the
> UTF-8 encoding of value in this range.  Therefore you are not allowed
> to generate pairs of such sequences either.
> 
> I hope this is all clear.

That is all clear, but I still wonder why you said that the six byte
sequence (which no conforming process can have produced) must not be
interpreted as a single character. Specifically, C12 is amended with

# Processes may transform irregular code unit sequences into the
# equivalent well-formed code unit sequences.

> > Other people have read Unicode 3.1 and came to the conclusion that
> > it mandates that implementations accept such a character...
> 
> Well, they're wrong.  The standard is clear as ink in this regard.

Not that clear to me... Please have a look at bug # 2 in

http://sourceforge.net/tracker/download.php?group_id=5470&atid=105470&file_id=7439&aid=433882

The submitter claims that an implementation has to accept a single
UTF-8 encoded surrogate word. Of course, it might be that accepting a
single one in UTF-8 is mandated, but if you have two of them, you must
reject them...

Regards,
Martin