[I18n-sig] Re: Unicode surrogates: just say no!
Rick McGowan
rick@unicode.org
Wed, 27 Jun 2001 13:04:00 -0700
Martin v. Loewis" <martin@loewis.home.cs.tu-berlin.de> wrote:
> The next question then is what to do with lone surrogate triplets; the
> table in TR 27 suggests they are legal, but people on this list have
> argued they must neither be emitted nor consumed (since what you get
> is not a legal USV).
Part of the confusion every has is because the UTFs have been envisioned
as both (A) pure mathematical transformations of integer spaces, and (B)
transformations of coded characters. But the explanations have been
muddled a little. Part of the re-write that's happening now in the Unicode
editorial committee is dealing with this confusion. In the future, I hope
that it can be clarified.
> an irregular sequence (of bytes, or code units) is not illegal.
> Also, I assume that negation of these concepts follows the English
> language rules (i.e. "not illegal" == "legal", "not ill-formed" ==
> "well-formed", etc)
Well, yes, you're right. However, in English when something phrased as
"not foo" that wording often carries the implication of some shadiness that
occupies the boundary between foo and anti-foo. In this sense, "not
illegal" does not mean the same thing as "legal". "Not illegal" means
something more like "socially backward and frowned upon, but not worthy of
legal prosecution in the strict sense".
Here's my take on irregular sequences / lone surrogates:
If you have a process which is claiming to take in arbitrary data and emit
identical data in the same or different UTF, then it should probably allow
unpaired surrogates to be eaten, stored, and re-emitted without error in
the UTF-8 input case.
If you have a process which is claiming to take in legal characters,
transform them into something else, then you can (A) barf on lone surrogate
pairs or (B) try to fix the situation.
Allowing the user of the API to decide which is preferrable in a given
situation is probably the right answer. I.e., the codec for UTF-8
reading/writing should have strict and non-strict modes. And strict mode
should be the default.
> The issue of UTF-8 encoded surrogate pairs is clear now to me, I hope:
> You must not write them, but you may read them.
Exactly. They could exist in nature; their existance cannot be ruled out,
and hence, it may transpire that you could be presented with one.
Rick