[I18n-sig] Re: validity of lone surrogates (was Re: Unicode surroga tes: just say no!)

Wed, 27 Jun 2001 07:13:39 -0700

Your are correct in that the text is not nearly as clear as it should be,
and is open to different interpretations. My view of the status in Unicode
3.1 is represented on http://www.macchiato.com/utc/utf_comparison.htm.
Corresponding computations are on
http://www.macchiato.com/utc/utf_computations.htm.

One of the goals for Unicode 4.0 is to clear up the text describing UTFs in
particular, which may change some of the edge cases (isolates and/or
irregulars).

Mark

----- Original Message -----
From: "Gaute B Strokkenes" <gs234@cam.ac.uk>
To: "Machin, John" <JMachin@Colonial.com.au>
Cc: <tree@basistech.com>; <guido@digicool.com>; <i18n-sig@python.org>;
<unicode@unicode.org>; "Martin v. Loewis"
<martin@loewis.home.cs.tu-berlin.de>
Sent: Wednesday, June 27, 2001 05:38
Subject: Re: validity of lone surrogates (was Re: Unicode surroga tes: just
say no!)

> On Wed, 27 Jun 2001, JMachin@Colonial.com.au wrote:
> >
> > [earlier correspondents]
> >>> Personally, I think that the codecs should report an error in the
> >>> appropriate fashion when presented with a python unicode string
> >>> which contains values that are not allowed, such as lone
> >>> surrogates.
> >>
> >> Other people have read Unicode 3.1 and came to the conclusion that
> >> it mandates that implementations accept such a character...
> >
> > [big Gaute]
> > Well, they're wrong.  The standard is clear as ink in this regard.
> >
> > [my comment]
> > Unfortunately ink is usually opaque :-)
>
> Precisely.  That's standardese for you.  8-)
>
> > The problem is caused by section 3.8 in Unicode 3.0, which is not
> > specifically amended by 3.1 as far as I can tell.
>
> It's not; AFAIK the list of changes at
> <http://www.unicode.org/unicode/reports/tr27/> is supposed to be
> canonical and it's not listed.
>
> > The offending text occurs after clause D29. It says "... every UTF
> > supports lossless round-trip transcoding ..." and "... a UTF mapping
> > must also map invalid Unicode scalar values to unique code value
> > sequences. These invalid scalar values include [0xFFFE], [0xFFFF]
> > and unpaired surrogates."
>
> Sigh.  This means that the Unicode standard is self-contradicting.
>
> It is nowhere defined precisely what "invalid Unicode Scalar Value"
> means.  I can only assume that it means "an integer in the range 0 -
> 0x10FFFF that is not a Unicode Scalar Value".  Even so, the statement
> is just plain wrong as far as UTF-16 is concerned.  If UTF-16 is
> supposed to define a bijective mapping any sequence of integers in the
> range 0 - 0x10FFFF to some set of sequences of integers in the range 0
> - 0xFFFF (and this is definitely what this statement is saying) this
> becomes a contradiction: suppose that H is some high surrogate value
> and that L is some low surrogate value, and that U is the
> corresponding USV.  Then the sequences
>
>   H, L    <-- sequence consisting of two "invalid USVs"
>
> and
>
>   U       <-- sequence consisting of a single (valid) USV
>
> both map to
>
>   H, L    <-- sequence of two UTF-16 code points
>
> under UTF-16, so that the mapping induced by UTF-16 is very definitely
> not bijective.
>
> I have no idea why the standard includes this apparent error, but my
> best guess would be that this used to be true back in the pre-3.1 days
> when UTF-16 (though not with that name) was Unicode proper and UTF-16
> was not a UTF, but _the_ canonical Unicode encoding.  Note that the
> statement given in D29 actually is true when applied to UTF-8 and
> UTF-32.
>
> However, let us put this annoying fact aside for a moment.  I believe
> that D29 is intended to point out that the various UTFs will "just
> work" if you try to encode scalar values that are not proper USVs.
> This is not the same thing as saying that these invalid USVs or the
> "pseudo-characters" or whatever that arise from them have any business
> in a Unicode string.  In fact, Unicode conformant processes are
> explicitly forbidden from interpreting or using U+FFFF or U+FFFE when
> passing Unicode data between each other.  They are, however,
> explicitly allowed and even encouraged to use these values internally
> as sentinel or "fencepost" values.  To put this slightly differently,
> a process may be storing some Unicode data internally and it may be
> storing U+FFFF for some reason or another in that internal data.  It
> may be convenient for the process to use an UTF to transform this data
> into a more convenient form.  I think that D19 is merely pointing out
> that this is actually feasible, in spite of the appearance of invalid
> USVs in the internal data.
>
> I would be indebted if any of the experts who hang out on the unicode
> list could sort out this confusion.
>
> > My interpretation of this is that the 2nd part I quoted says we must
> > export the guff, and the 1st part says we must accept it back again.
> >
> > I don't particularly like this idea, and am not in favour of codecs
> > silently accepting such in incoming data --- I'm just pointing out
> > that this "lossless round-trip transcoding" concept seems to be at
> > variance with various interpretations of what is "legal".
>
> Yup.
>
> My take on this is that the various UTF codecs should follow the specs
> to the letter and reject antything else in default mode.  There should
> also be a "lenient" or "forgiving" mode in which the codec does its
> best to interpret and repair broken, nonsensical or irregular data.
> Off course, if an application uses this mode then it will have to be
> aware of the dangers involved, including the security aspects.
>
> --
> Big Gaute                               http://www.srcf.ucam.org/~gs234/
> I'm having BEAUTIFUL THOUGHTS about the INSIPID WIVES
>  of smug and wealthy CORPORATE LAWYERS..
>
>