New subject: Support Unicode code point labels (Was: notation)

2 Aug 2013

      I am starting a new thread to discuss an idea that is orthogonal to Steven
D'Aprano's \U+NNNN proposal.

The Unicode Standard defines five types of code points for which it does
not provide a unique Name property.  These types are: Control, Reserved,
Noncharacter, Private-use and Surrogate.

When a unique descriptive label is required for any such code point, the
standard recommends constructing a label as follows: "For each code point
type without character names, code point labels are constructed by using a
lowercase prefix derived from the code point type, followed by a
hyphen-minus and then a 4- to 6-digit hexadecimal representation of the
code point."

I propose adding support for these labels to unicodedata.lookup(), \N{..}
and unicodedata.name() (or unicodedata.label()).

In the previous thread, there was a disagreement on whether invalid labels
(such as reserved-0009 instead of control-0009) should be accepted.  I will
address this in my response to Stephen Turnbull's e-mail below.

Another question is how to add support for generating the labels in a
backward compatible manner.

Currently unicodedata.name() raises ValueError when no name is available
for a code point:
...
...
...
unicodedata.name(chr(0x0009))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: no such name
Since unicodedata.name() also supports specifying default, it is unlikely
that users write code like this

try:
   name =  unicodedata.name(x)
except ValueError:
   name = 'U+{:04X}'.format(ord(x))

instead of

name = unicodedata.name(x, '') or 'U+{:04X}'.format(ord(x))

However, the most conservative approach is not to change the behavior of
unicodedata.name() and provide a new function unicodedata.label().

On Fri, Aug 2, 2013 at 3:36 AM, Stephen J. Turnbull 
wrote:
...
Alexander Belopolsky writes:
...
.. why would you write \N{reserved-NNNN} instead of
\uNNNN to begin with?
I wouldn't.  The problem isn't writing "\N{reserved-50000}".  It's
the other way around: I want to *write* "\N{control-50000}" which
expresses my intent in Python 3.5 and not have it blow up in Python
3.4 which uses an older UCD where U+50000 is unassigned.
"\N{control-50000}" will blow up in every past, present or future Python
version.  Since Unicode 1.1.5, "The General_Category property value Control
(Cc) is immutable: the set of code points with that value will never
change." http://www.unicode.org/policies/stability_policy.html
...
...
With the possible exception or reserved-, on a rare occasion when you
want to be explicit about the character type, it is useful to be
strict.
As explained above, strictness is not backward compatible with older
versions of the UCD that might be in use in older versions of Python.
This is not an issue for versions of Python that currently exist because
they do not support \N{<type-prefix>-NNNN} syntax at all.   What may happen
if my proposal is accepted is that \N{reserved-50000} will be valid in
Python 3.N but invalid in 3.N+1 for some N > 3.  If this becomes an issue,
we can solve this problem when the time comes.  It is always easier to
relax the rules than to make them stricter.  Yet, I still don't see the
problem.  You can already write

assert unicodedata.category(chr(0x50000)) == 'Cn'

in your code and this will blow up in any future version that will use UCD
with U+50000 assigned.

You can think of "\N{<type-prefix>-NNNN}" as a syntax sugar for "\uNNNN"
followed by an assert.

Re: [Python-ideas] Support Unicode code point labels (Was: notation)

Alexander Belopolsky

Stephen J. Turnbull

Alexander Belopolsky

MRAB

Stephen J. Turnbull

Nick Coghlan

tags

participants (4)