[Python-ideas] Support Unicode code point labels (Was: notation)
Alexander Belopolsky
alexander.belopolsky at gmail.com
Fri Aug 2 18:42:19 CEST 2013
I am starting a new thread to discuss an idea that is orthogonal to Steven
D'Aprano's \U+NNNN proposal.
The Unicode Standard defines five types of code points for which it does
not provide a unique Name property. These types are: Control, Reserved,
Noncharacter, Private-use and Surrogate.
When a unique descriptive label is required for any such code point, the
standard recommends constructing a label as follows: "For each code point
type without character names, code point labels are constructed by using a
lowercase prefix derived from the code point type, followed by a
hyphen-minus and then a 4- to 6-digit hexadecimal representation of the
code point."
I propose adding support for these labels to unicodedata.lookup(), \N{..}
and unicodedata.name() (or unicodedata.label()).
In the previous thread, there was a disagreement on whether invalid labels
(such as reserved-0009 instead of control-0009) should be accepted. I will
address this in my response to Stephen Turnbull's e-mail below.
Another question is how to add support for generating the labels in a
backward compatible manner.
Currently unicodedata.name() raises ValueError when no name is available
for a code point:
>>> unicodedata.name(chr(0x0009))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name
Since unicodedata.name() also supports specifying default, it is unlikely
that users write code like this
try:
name = unicodedata.name(x)
except ValueError:
name = 'U+{:04X}'.format(ord(x))
instead of
name = unicodedata.name(x, '') or 'U+{:04X}'.format(ord(x))
However, the most conservative approach is not to change the behavior of
unicodedata.name() and provide a new function unicodedata.label().
On Fri, Aug 2, 2013 at 3:36 AM, Stephen J. Turnbull <stephen at xemacs.org>
wrote:
>
> Alexander Belopolsky writes:
> > .. why would you write \N{reserved-NNNN} instead of
> > \uNNNN to begin with?
>
> I wouldn't. The problem isn't writing "\N{reserved-50000}". It's
> the other way around: I want to *write* "\N{control-50000}" which
> expresses my intent in Python 3.5 and not have it blow up in Python
> 3.4 which uses an older UCD where U+50000 is unassigned.
"\N{control-50000}" will blow up in every past, present or future Python
version. Since Unicode 1.1.5, "The General_Category property value Control
(Cc) is immutable: the set of code points with that value will never
change." <http://www.unicode.org/policies/stability_policy.html>
> > With the possible exception or reserved-, on a rare occasion when you
> > want to be explicit about the character type, it is useful to be
> > strict.
>
> As explained above, strictness is not backward compatible with older
> versions of the UCD that might be in use in older versions of Python.
>
This is not an issue for versions of Python that currently exist because
they do not support \N{<type-prefix>-NNNN} syntax at all. What may happen
if my proposal is accepted is that \N{reserved-50000} will be valid in
Python 3.N but invalid in 3.N+1 for some N > 3. If this becomes an issue,
we can solve this problem when the time comes. It is always easier to
relax the rules than to make them stricter. Yet, I still don't see the
problem. You can already write
assert unicodedata.category(chr(0x50000)) == 'Cn'
in your code and this will blow up in any future version that will use UCD
with U+50000 assigned.
You can think of "\N{<type-prefix>-NNNN}" as a syntax sugar for "\uNNNN"
followed by an assert.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20130802/f9f520da/attachment.html>
More information about the Python-ideas
mailing list