[I18n-sig] Re: How does Python Unicode treat surrogates?
Martin v. Loewis
martin@loewis.home.cs.tu-berlin.de
Tue, 26 Jun 2001 01:22:27 +0200
> Do we permit such a sequence to be held internally as a "Unicode string"?
> Is u"\udc00" legal in source code or should Python throw a syntax error?
I think it shouldn't. If we disallow it, we should
a) simultaneously disallow unichr(0xDC00)
b) allow \U00010000, and unichr(0x10000), which would both give strings
with two Py_UNICODE values inside (leaving out the question what len()
of such a string would give).
> We *do* need to consider UTF encodings, because Unicode *expressly*
> allows decoding UTF sequences that become unpaired surrogates, or
> other "not 100% valid" scalars such as 0xffff and 0xfffe. So, given
> that Python supports Unicode, not ISO 10646, we must IMO permit such
> sequences in our internal representation.
I think the Unicode standard is in error here (or somebody is
misinterpreting it). It has happened before: Unicode 2.0 strongly
believed that the internal representation of a unicode character MUST
be 16-bit, and found some funny wording to mark a 32-bit wchar_t as
not strictly compliant, but acceptable. Unicode 3.1 has finally
revised this wrong view.
> It follows that we should stop worrying about these irregular values
> -- it's less programming that way. Unicode 3.1 will create enough
> extra programming as it is, because we now have variable-length
> characters again -- just what Unicode was going to save us from :-(
We wouldn't if we could widen Py_UNICODE to 32 bits...
Regards,
Martin