[Python-ideas] Support Unicode code point notation

Alexander Belopolsky alexander.belopolsky at gmail.com
Fri Aug 2 00:58:49 CEST 2013


On Thu, Aug 1, 2013 at 6:09 PM, Terry Reedy <tjreedy at udel.edu> wrote:

> Why would someone write 'control-' instead of 'U+'?


Because this is the recommended way to form the code-point labels:

"For each code point type without character names, code point labels are
constructed by using a lowercase prefix derived from the code point type,
followed by a hyphen-minus and then a 4- to 6-digit hexadecimal
representation of the code point."

"To avoid any possible confusion with actual, non-null Name property
values, constructed Unicode code point labels are often displayed between
angle brackets: <control-0009>, <noncharacter-FFFF>, and so on. This
convention is used consistently in the data files for the Unicode Character
Database."

"A constructed code point label is distinguished from the designation of
the code point itself (for example, “U+0009” or “U+FFFF”), which is also a
unique identifier, as described in Appendix A, Notational Conventions." <
http://www.unicode.org/versions/Unicode6.2.0/ch04.pdf>

I would rather see unicodedata.lookup() to be extended to accept code-point
labels rather than "the designation of the code point itself."  The same
applies to \N escape: I would rather see \N{control-NNNN} or
\N{surrogate-NNNN}  in string literals than some mysterious \N{U+NNNN}.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20130801/31dbf111/attachment.html>


More information about the Python-ideas mailing list