[Python-ideas] Support Unicode code point notation

Chris Angelico rosuav at gmail.com
Sun Jul 28 08:59:26 CEST 2013


On Sun, Jul 28, 2013 at 4:57 AM, Steven D'Aprano <steve at pearwood.info> wrote:
> On 28/07/13 10:30, Andrew Barnert wrote:
>
>> Unicode could go past 10FFFF without dropping UTF-16, either by adding
>> more surrogate pair ranges, or by adding surrogate triplets. It's really no
>> different from extending UTF-8, which is no problem.
>>
>> The problem is that we have no way to predict how they will extend UTF-16,
>> UTF-8, or code point notation if that ever happens. Assuming that the max
>> length for a code point is six nibbles does sound like assuming nobody will
>> ever need more than 640k characters.
>
>
> The Unicode Consortium formally guarantees stability of the character range
> U+0000 - U+10FFFF.
>
> http://www.unicode.org/faq/utf_bom.html#utf16-6

And to add to this: Surrogate triplets would majorly break one of the
fundamentals of UTF-16, namely that it guarantees synchronizability.
You can look at any 16-bit code unit and know whether it's a lead or
trail surrogate. (Obviously if you write to a file or other byte
stream, you have to have some out-of-band way to synchronize on bytes,
that's separate.) So there's unlikely ever to be a scheme that extends
UTF-16 to more characters. UTF-8 can in theory handle longer codes
(and some encoders can simply use the same mathematical technique to
encode numbers larger than 10FFFF, as we've already seen).

The only way would be to declare UTF-16 as a flawed system, just as
UCS-2 is. It's a system that can encode only the first planes of
Unicode. I doubt it'll ever happen, though, as there's no need for
more space.

ChrisA


More information about the Python-ideas mailing list