[I18n-sig] Re: Unicode surrogates: just say no!

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Wed, 27 Jun 2001 07:54:22 +0200

> This is wrong.  It is a bug to encode a non-BMP character with six
> bytes by pretending that the (surrogate) values used in the UTF-16
> representation are BMP characters and encoding the character as though
> it was a string consisting of that character.  It is also a bug to
> interpret such a six-byte sequence as a single character.  This was
> clarified in Unicode 3.1.

It seems to be unclear to many, including myself, what exactly was
clarified with Unicode 3.1. Where exactly does it say that processing
a six-byte two-surrogates sequence as a single character is
non-conforming? What exactly does it say that the conforming behaviour
should be?

> Personally, I think that the codecs should report an error in the
> appropriate fashion when presented with a python unicode string which
> contains values that are not allowed, such as lone surrogates.  

Other people have read Unicode 3.1 and came to the conclusion that it
mandates that implementations accept such a character...