[Python-ideas] Support Unicode code point labels (Was: notation)

Fri Aug 2 18:42:19 CEST 2013

I am starting a new thread to discuss an idea that is orthogonal to Steven
D'Aprano's \U+NNNN proposal.

The Unicode Standard defines five types of code points for which it does
not provide a unique Name property.  These types are: Control, Reserved,
Noncharacter, Private-use and Surrogate.

When a unique descriptive label is required for any such code point, the
standard recommends constructing a label as follows: "For each code point
type without character names, code point labels are constructed by using a
lowercase prefix derived from the code point type, followed by a
hyphen-minus and then a 4- to 6-digit hexadecimal representation of the
code point."

I propose adding support for these labels to unicodedata.lookup(), \N{..}
and unicodedata.name() (or unicodedata.label()).

In the previous thread, there was a disagreement on whether invalid labels
(such as reserved-0009 instead of control-0009) should be accepted.  I will
address this in my response to Stephen Turnbull's e-mail below.

Another question is how to add support for generating the labels in a
backward compatible manner.

Currently unicodedata.name() raises ValueError when no name is available
for a code point:

>>> unicodedata.name(chr(0x0009))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: no such name

Since unicodedata.name() also supports specifying default, it is unlikely
that users write code like this

try:
   name =  unicodedata.name(x)
except ValueError:
   name = 'U+{:04X}'.format(ord(x))

instead of

name = unicodedata.name(x, '') or 'U+{:04X}'.format(ord(x))

However, the most conservative approach is not to change the behavior of
unicodedata.name() and provide a new function unicodedata.label().

On Fri, Aug 2, 2013 at 3:36 AM, Stephen J. Turnbull <stephen at xemacs.org>
wrote:
>
> Alexander Belopolsky writes:
>  > .. why would you write \N{reserved-NNNN} instead of
>  > \uNNNN to begin with?
>
> I wouldn't.  The problem isn't writing "\N{reserved-50000}".  It's
> the other way around: I want to *write* "\N{control-50000}" which
> expresses my intent in Python 3.5 and not have it blow up in Python
> 3.4 which uses an older UCD where U+50000 is unassigned.

 "\N{control-50000}" will blow up in every past, present or future Python
version.  Since Unicode 1.1.5, "The General_Category property value Control
(Cc) is immutable: the set of code points with that value will never
change." <http://www.unicode.org/policies/stability_policy.html>

>  > With the possible exception or reserved-, on a rare occasion when you
>  > want to be explicit about the character type, it is useful to be
>  > strict.
>
> As explained above, strictness is not backward compatible with older
> versions of the UCD that might be in use in older versions of Python.
>

This is not an issue for versions of Python that currently exist because
they do not support \N{<type-prefix>-NNNN} syntax at all.   What may happen
if my proposal is accepted is that \N{reserved-50000} will be valid in
Python 3.N but invalid in 3.N+1 for some N > 3.  If this becomes an issue,
we can solve this problem when the time comes.  It is always easier to
relax the rules than to make them stricter.  Yet, I still don't see the
problem.  You can already write

assert unicodedata.category(chr(0x50000)) == 'Cn'

in your code and this will blow up in any future version that will use UCD
with U+50000 assigned.

You can think of "\N{<type-prefix>-NNNN}" as a syntax sugar for "\uNNNN"
followed by an assert.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20130802/f9f520da/attachment.html>