[I18n-sig] validity of lone surrogates (was Re: Unicode surroga tes: just say no!)

Machin, John JMachin@Colonial.com.au
Wed, 27 Jun 2001 18:27:50 +1000

-----Original Message-----
From: Gaute B Strokkenes [mailto:gs234@cam.ac.uk]
Sent: Wednesday, 27 June 2001 17:53
To: Martin v. Loewis
Cc: tree@basistech.com; guido@digicool.com; i18n-sig@python.org;
Subject: [I18n-sig] Re: Unicode surrogates: just say no!

[earlier correspondents]
>> Personally, I think that the codecs should report an error in the
>> appropriate fashion when presented with a python unicode string
>> which contains values that are not allowed, such as lone
>> surrogates.
> Other people have read Unicode 3.1 and came to the conclusion that
> it mandates that implementations accept such a character...

[big Gaute]
Well, they're wrong.  The standard is clear as ink in this regard.

[my comment]
Unfortunately ink is usually opaque :-)

The problem is caused by section 3.8 in Unicode 3.0, which is not
specifically amended by 3.1 as far as I can tell.

The offending text occurs after clause D29. It says "... every UTF supports
lossless round-trip transcoding ..." and "... a UTF mapping must also map
invalid Unicode scalar values to unique code value sequences. These invalid 
scalar values include [0xFFFE], [0xFFFF] and unpaired surrogates."

My interpretation of this is that the 2nd part I quoted says we must export
the guff,
and the 1st part says we must accept it back again.

I don't particularly like this idea, and am not in favour of codecs silently
accepting such in incoming data --- I'm just pointing out that this 
"lossless round-trip transcoding" concept seems to be at variance with
interpretations of what is "legal".


**************   IMPORTANT MESSAGE  **************

The information contained in or attached to this message is intended only for the people it is addressed to. If you are not the intended recipient, any use, disclosure or copying of this information is unauthorised and prohibited. This information may be confidential or subject to legal privilege. It is not the expressed view of Colonial Limited or any of its subsidiaries unless that is clearly stated. Colonial cannot accept liability for any virus damage caused by this message.