[I18n-sig] Re: Unicode surrogates: just say no!

Rick McGowan rick@unicode.org
Wed, 27 Jun 2001 08:52:28 -0700


Martin v. Loewis" <martin@loewis.home.cs.tu-berlin.de> wrote:

> It seems to be unclear to many, including myself, what exactly was
> clarified with Unicode 3.1. Where exactly does it say that processing
> a six-byte two-surrogates sequence as a single character is
> non-conforming?

It's not non-conforming, it's "irregular". Please read the technical  
report (#27) that I pointed at yesterday (on the i18n-sig@python).  It  
gives detailed specifications for UTF-8.  Anything not in the table "UTF-8  
Bit Distribution" and accompanying description shown there is  
non-conforming.

Rule D36 specifies:

<quote>
(a) UTF-8 is the Unicode Transformation Format that serializes a Unicode  
code point as a sequence of one to four bytes, as specified in Table 3.1,  
UTF-8 Bit Distribution.
(b) An illegal UTF-8 code unit sequence is any byte sequence that does not  
match the patterns listed in Table 3.1B, Legal UTF-8 Byte Sequences.
(c) An irregular UTF-8 code unit sequence is a six-byte sequence where the  
first three bytes correspond to a high surrogate, and the next three bytes  
correspond to a low surrogate. As a consequence of C12, these irregular  
UTF-8 sequences shall not be generated by a conformant process.
</quote>

In other words, it is non-conforming to generate two 3-byte things for a  
surrogate pair.  However, it remains "legal but irregular" to interpret  
such a pair of 3-byte entities.  Why wasn't it just made non-conforming to  
interpret such things?  Because there are old implementations of UTF-8 in  
the world that pre-date the definition of surrogates, and if they ever  
encountered codepoints in that range, they would generate those pairs of  
3-byte sequences.  So it is legal for a process to recognize them and  
either raise an exception or try to "fix" the situation.

> What exactly does it say that the conforming behaviour
> should be?

TR27 says: "Processes that require unique representation must not  
interpret irregular UTF code unit sequences as characters. They may, for  
example, reject or remove those sequences."

If I were going to implement a UTF-8 interpeter for Python, I would give  
it a hook to optionally return a specific error condition on irregular  
sequences.

If you still find the definitions and discussion in the technical report  
to be unclear, then the Unicode editorial committee would undoubtedly like  
to hear about it.

	Rick