[I18n-sig] Re: Unicode surrogates: just say no!

Rick McGowan rick@unicode.org
Wed, 27 Jun 2001 13:04:00 -0700

Martin v. Loewis" <martin@loewis.home.cs.tu-berlin.de> wrote:

> The next question then is what to do with lone surrogate triplets; the
> table in TR 27 suggests they are legal, but people on this list have
> argued they must neither be emitted nor consumed (since what you get
> is not a legal USV).

Part of the confusion every has is because the UTFs have been envisioned  
as both (A) pure mathematical transformations of integer spaces, and (B)  
transformations of coded characters.  But the explanations have been  
muddled a little.  Part of the re-write that's happening now in the Unicode  
editorial committee is dealing with this confusion.  In the future, I hope  
that it can be clarified.

> an irregular sequence (of bytes, or code units) is not illegal.
> Also, I assume that negation of these concepts follows the English
> language rules (i.e. "not illegal" == "legal", "not ill-formed" ==
> "well-formed", etc)

Well, yes, you're right.  However, in English when something phrased as  
"not foo" that wording often carries the implication of some shadiness that  
occupies the boundary between foo and anti-foo.  In this sense, "not  
illegal" does not mean the same thing as "legal".  "Not illegal" means  
something more like "socially backward and frowned upon, but not worthy of  
legal prosecution in the strict sense".

Here's my take on irregular sequences / lone surrogates:

If you have a process which is claiming to take in arbitrary data and emit  
identical data in the same or different UTF, then it should probably allow  
unpaired surrogates to be eaten, stored, and re-emitted without error in  
the UTF-8 input case.

If you have a process which is claiming to take in legal characters,  
transform them into something else, then you can (A) barf on lone surrogate  
pairs or (B) try to fix the situation.

Allowing the user of the API to decide which is preferrable in a given  
situation is probably the right answer.  I.e., the codec for UTF-8  
reading/writing should have strict and non-strict modes.  And strict mode  
should be the default.

> The issue of UTF-8 encoded surrogate pairs is clear now to me, I hope:
> You must not write them, but you may read them.

Exactly.  They could exist in nature; their existance cannot be ruled out,  
and hence, it may transpire that you could be presented with one.