[Python-ideas] Support Unicode code point notation
Greg Ewing
greg.ewing at canterbury.ac.nz
Sun Jul 28 01:14:50 CEST 2013
Steven D'Aprano wrote:
> Aside: you keep writing H..HHHHHH for Unicode code points. Unicode code
> points go up to hex 10FFFF,
They do *now*, but we can't be sure that they will stay that
way in the future.
This isn't a problem for the U+XXXX notation in informal usage,
since it's usually written with surrounding whitespace or
punctuation that makes it clear where the digits end. But the
\U+XXXX syntax as currently proposed would bake in an absolute
6-digit limit that's impossible to ever extend.
> I'd like to be able to tell people:
>
> "To enter a Unicode code point in a string, put a backslash in front of
> it."
>
> instead of telling them to count the number of hex digits,
But they're *still* going to have to count hex digits, and pad
to 6 if it happens to be followed by a problematic character.
If we're going to introduce something new, we might as well
design it not to have silly, awkward properties like that.
The Ruby \U{...} syntax has the following advantages:
* Very clear, not prone to editing errors
* No fixed limit on number of digits
* Extends easily to multiple code points
* Can optionally accept U+ for those who like that
* Precedent exists in at least one other language
Or we could invent something of our own, such as using another
backslash as a delimiter:
\U+1234\
Multiple characters could be written as:
\U+1234+5678+9abc\
--
Greg
More information about the Python-ideas
mailing list