[I18n-sig] Re: How does Python Unicode treat surrogates?

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Tue, 26 Jun 2001 01:22:27 +0200


> Do we permit such a sequence to be held internally as a "Unicode string"?
> Is u"\udc00" legal in source code or should Python throw a syntax error?

I think it shouldn't. If we disallow it, we should
a) simultaneously disallow unichr(0xDC00)
b) allow \U00010000, and unichr(0x10000), which would both give strings
   with two Py_UNICODE values inside (leaving out the question what len()
   of such a string would give).

> We *do* need to consider UTF encodings, because Unicode *expressly*
> allows decoding UTF sequences that become unpaired surrogates, or
> other "not 100% valid" scalars such as 0xffff and 0xfffe. So, given
> that Python supports Unicode, not ISO 10646, we must IMO permit such
> sequences in our internal representation.

I think the Unicode standard is in error here (or somebody is
misinterpreting it). It has happened before: Unicode 2.0 strongly
believed that the internal representation of a unicode character MUST
be 16-bit, and found some funny wording to mark a 32-bit wchar_t as
not strictly compliant, but acceptable. Unicode 3.1 has finally
revised this wrong view.

> It follows that we should stop worrying about these irregular values
> -- it's less programming that way. Unicode 3.1 will create enough
> extra programming as it is, because we now have variable-length
> characters again -- just what Unicode was going to save us from :-(

We wouldn't if we could widen Py_UNICODE to 32 bits...

Regards,
Martin