[I18n-sig] Re: Unicode surrogates: just say no!
Martin v. Loewis
martin@loewis.home.cs.tu-berlin.de
Wed, 27 Jun 2001 14:04:18 +0200
> >> It is also a
> >> bug to interpret such a six-byte sequence as a single character.
> >> This was clarified in Unicode 3.1.
> >
> > It seems to be unclear to many, including myself, what exactly was
> > clarified with Unicode 3.1.
>
> See the section called "UTF-8 Corrigendum" in TR 27. It explains it
> all in detail.
I've read this section forth an back over and over again, admittedly
without having a copy of Unicode 3.0 at hand to mentally apply the
changes.
> > Where exactly does it say that processing a six-byte two-surrogates
> > sequence as a single character is non-conforming?
>
> See D39(c) at <http://www.unicode.org/unicode/reports/tr27>. This
> defines such a six-byte sequence as an "irregular UTF-8 code unit
> sequence" and goes on to state that, as a consequence of C12,
> conforminig processes are not allowed to generate such sequences.
[I guess this is D36(c)]
Yes, but you've claimed that one *also* must not interpret such a
sequence as a single character - this only says that you must never
generate such a sequence.
> Therefore you are not allowed to create a 3 byte sequence that is the
> UTF-8 encoding of value in this range. Therefore you are not allowed
> to generate pairs of such sequences either.
>
> I hope this is all clear.
That is all clear, but I still wonder why you said that the six byte
sequence (which no conforming process can have produced) must not be
interpreted as a single character. Specifically, C12 is amended with
# Processes may transform irregular code unit sequences into the
# equivalent well-formed code unit sequences.
> > Other people have read Unicode 3.1 and came to the conclusion that
> > it mandates that implementations accept such a character...
>
> Well, they're wrong. The standard is clear as ink in this regard.
Not that clear to me... Please have a look at bug # 2 in
http://sourceforge.net/tracker/download.php?group_id=5470&atid=105470&file_id=7439&aid=433882
The submitter claims that an implementation has to accept a single
UTF-8 encoded surrogate word. Of course, it might be that accepting a
single one in UTF-8 is mandated, but if you have two of them, you must
reject them...
Regards,
Martin