[Python-ideas] Support Unicode code point labels (Was: notation)

Alexander Belopolsky alexander.belopolsky at gmail.com
Sat Aug 3 22:59:08 CEST 2013


On Sat, Aug 3, 2013 at 10:35 AM, Stephen J. Turnbull <stephen at xemacs.org>wrote:


> The problem is that someone will use code written by someone using a
> future version and run it with a past version, and the assert will
> trigger.  I don't see any good reason why it should.  The Unicode
> Standard explicitly specifies how unknown code points should be
> handled.  Raising an exception is not part of that spec.


It looks like we are running into a confusion between code points and their
name property.  I agree that a conforming function that returns the name
for a code point should not raise an exception.  The standard offers two
alternatives: return an empty string or return a generated label.  In this
sense unicodedata.name() is not conforming:

>>> unicodedata.name('\u0009')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: no such name

However, it is trivial to achieve conforming behavior:

>>> unicodedata.name('\u0009', '')
''

I propose adding unicode.label() function that will return that will return
 'control-0009' in this case.  I think this proposal is fully inline with
the standard.

For the inverse operation, unicodedata.lookup(), I don't see anything in
the standard that precludes raising an exception on an unknown name.  If
that was a problem, we would have it already.

In Python >=3.2:

>>> unicodedata.lookup('BELL')
'🔔'

But in Python 3.1:

>>> unicodedata.lookup('BELL')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: "undefined character name 'BELL'"

The only potential problem that I see with my proposal is that it is
reasonable to expect that if '\N{whatever}' works in one version it will
work the same in all versions after that.  My proposal will break this
expectation only in the case of '\N{reserved-NNNN}'.  Once a code point
NNNN is assigned '\N{reserved-NNNN}' will become a syntax error.

If you agree that this is the only problematic case, let's focus on it.  I
cannot think of any reason to deliberately use reserved characters other
than to stress-test your unicode handling software.  In this application,
you probably want to see an error once NNNN is assigned because your tests
will no longer cover the unassigned character case.

Can you suggest any other use?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20130803/de86ca1f/attachment.html>


More information about the Python-ideas mailing list