[Python-ideas] Support Unicode code point notation

Greg Ewing greg.ewing at canterbury.ac.nz
Sun Jul 28 01:14:50 CEST 2013


Steven D'Aprano wrote:
> Aside: you keep writing H..HHHHHH for Unicode code points. Unicode code 
> points go up to hex 10FFFF,

They do *now*, but we can't be sure that they will stay that
way in the future.

This isn't a problem for the U+XXXX notation in informal usage,
since it's usually written with surrounding whitespace or
punctuation that makes it clear where the digits end. But the
\U+XXXX syntax as currently proposed would bake in an absolute
6-digit limit that's impossible to ever extend.

 > I'd like to be able to tell people:
> 
> "To enter a Unicode code point in a string, put a backslash in front of 
> it."
> 
> instead of telling them to count the number of hex digits,

But they're *still* going to have to count hex digits, and pad
to 6 if it happens to be followed by a problematic character.

If we're going to introduce something new, we might as well
design it not to have silly, awkward properties like that.

The Ruby \U{...} syntax has the following advantages:

* Very clear, not prone to editing errors
* No fixed limit on number of digits
* Extends easily to multiple code points
* Can optionally accept U+ for those who like that
* Precedent exists in at least one other language

Or we could invent something of our own, such as using another
backslash as a delimiter:

    \U+1234\

Multiple characters could be written as:

    \U+1234+5678+9abc\

-- 
Greg



More information about the Python-ideas mailing list