[Python-ideas] Support Unicode code point notation

Sun Jul 28 05:43:39 CEST 2013

On 28/07/13 09:14, Greg Ewing wrote:
> Steven D'Aprano wrote:
>> Aside: you keep writing H..HHHHHH for Unicode code points. Unicode code points go up to hex 10FFFF,
>
> They do *now*, but we can't be sure that they will stay that
> way in the future.

Yes we can. The Unicode Consortium have guaranteed that Unicode will never be extended past code point U+10FFFF.

I quote:

Q: Will UTF-16 ever be extended to more than a million characters?

A: No. Both Unicode and ISO 10646 have policies in place that formally limit future code assignment to the integer range that can be expressed with current UTF-16 (0 to 1,114,111).

http://www.unicode.org/faq/utf_bom.html#utf16-6

Supporting some hypothetical "Super-hyper-mega-Code" in 2035 will be as big a change as adding Unicode in the first place. It will probably require a PEP :-)

[...]
>> I'd like to be able to tell people:
>>
>> "To enter a Unicode code point in a string, put a backslash in front of it."
>>
>> instead of telling them to count the number of hex digits,
>
> But they're *still* going to have to count hex digits, and pad
> to 6 if it happens to be followed by a problematic character.

Most uses of hex escapes aren't followed by another hex digit: there are in excess of a million Unicode code points, and less than 50 are hex digits (less than 30 if you exclude East-Asian full-width forms). To return to the example that keeps being given, if you're writing Ethiopian text, I don't think it is actually very likely that you will want to follow ETHIOPIC SYLLABLE SEE by a Latin digit 5 with no separator between them. Yes, it "might" happen, but there are trivial ways to deal that, in no particular order:

- pad the code point to six digits

- don't use \U+, use a fixed-width \u or \U escape

- use string concatenation '\U+1234' '5'

- use string substitutions (% or format or $ templates).

> If we're going to introduce something new, we might as well
> design it not to have silly, awkward properties like that.
>
> The Ruby \U{...} syntax has the following advantages:
>
> * Very clear, not prone to editing errors
> * No fixed limit on number of digits
> * Extends easily to multiple code points
> * Can optionally accept U+ for those who like that
> * Precedent exists in at least one other language

As I said earlier, if someone wants to champion that idea, I won't object.

> Or we could invent something of our own, such as using another
> backslash as a delimiter:
>
>     \U+1234\
>
> Multiple characters could be written as:
>
>     \U+1234+5678+9abc\
>

Another suggestion which was made is:

\N{U+xxxx}

(Sorry, I have forgotten who made that suggestion originally.) That could be extended to allow multiple space-separated code points:

\N{U+xxxx U+yyyy U+zzzzz}

or

\N{U+xxxx yyyy zzzzz}

-- 
Steven