[I18n-sig] Re: Unicode surrogates: just say no!
Rick McGowan
rick@unicode.org
Wed, 27 Jun 2001 08:52:28 -0700
Martin v. Loewis" <martin@loewis.home.cs.tu-berlin.de> wrote:
> It seems to be unclear to many, including myself, what exactly was
> clarified with Unicode 3.1. Where exactly does it say that processing
> a six-byte two-surrogates sequence as a single character is
> non-conforming?
It's not non-conforming, it's "irregular". Please read the technical
report (#27) that I pointed at yesterday (on the i18n-sig@python). It
gives detailed specifications for UTF-8. Anything not in the table "UTF-8
Bit Distribution" and accompanying description shown there is
non-conforming.
Rule D36 specifies:
<quote>
(a) UTF-8 is the Unicode Transformation Format that serializes a Unicode
code point as a sequence of one to four bytes, as specified in Table 3.1,
UTF-8 Bit Distribution.
(b) An illegal UTF-8 code unit sequence is any byte sequence that does not
match the patterns listed in Table 3.1B, Legal UTF-8 Byte Sequences.
(c) An irregular UTF-8 code unit sequence is a six-byte sequence where the
first three bytes correspond to a high surrogate, and the next three bytes
correspond to a low surrogate. As a consequence of C12, these irregular
UTF-8 sequences shall not be generated by a conformant process.
</quote>
In other words, it is non-conforming to generate two 3-byte things for a
surrogate pair. However, it remains "legal but irregular" to interpret
such a pair of 3-byte entities. Why wasn't it just made non-conforming to
interpret such things? Because there are old implementations of UTF-8 in
the world that pre-date the definition of surrogates, and if they ever
encountered codepoints in that range, they would generate those pairs of
3-byte sequences. So it is legal for a process to recognize them and
either raise an exception or try to "fix" the situation.
> What exactly does it say that the conforming behaviour
> should be?
TR27 says: "Processes that require unique representation must not
interpret irregular UTF code unit sequences as characters. They may, for
example, reject or remove those sequences."
If I were going to implement a UTF-8 interpeter for Python, I would give
it a hook to optionally return a specific error condition on irregular
sequences.
If you still find the definitions and discussion in the technical report
to be unclear, then the Unicode editorial committee would undoubtedly like
to hear about it.
Rick