[Python-ideas] Support Unicode code point labels (Was: notation)

Sun Aug 4 01:57:13 CEST 2013

On 03/08/2013 21:59, Alexander Belopolsky wrote:
>
> On Sat, Aug 3, 2013 at 10:35 AM, Stephen J. Turnbull <stephen at xemacs.org
> <mailto:stephen at xemacs.org>> wrote:
>
>     The problem is that someone will use code written by someone using a
>     future version and run it with a past version, and the assert will
>     trigger.  I don't see any good reason why it should.  The Unicode
>     Standard explicitly specifies how unknown code points should be
>     handled.  Raising an exception is not part of that spec.
>
>
> It looks like we are running into a confusion between code points and
> their name property.  I agree that a conforming function that returns
> the name for a code point should not raise an exception.  The standard
> offers two alternatives: return an empty string or return a generated
> label.  In this sense unicodedata.name <http://unicodedata.name>() is
> not conforming:
>
>  >>> unicodedata.name <http://unicodedata.name>('\u0009')
> Traceback (most recent call last):
>    File "<stdin>", line 1, in <module>
> ValueError: no such name
>
> However, it is trivial to achieve conforming behavior:
>
>  >>> unicodedata.name('\u0009', '')
> ''
>
> I propose adding unicode.label() function that will return that will
> return  'control-0009' in this case.  I think this proposal is fully
> inline with the standard.
>
> For the inverse operation, unicodedata.lookup(), I don't see anything in
> the standard that precludes raising an exception on an unknown name.  If
> that was a problem, we would have it already.
>
> In Python >=3.2:
>
>  >>> unicodedata.lookup('BELL')
> '🔔'
>
> But in Python 3.1:
>
>  >>> unicodedata.lookup('BELL')
> Traceback (most recent call last):
>    File "<stdin>", line 1, in <module>
> KeyError: "undefined character name 'BELL'"
>
> The only potential problem that I see with my proposal is that it is
> reasonable to expect that if '\N{whatever}' works in one version it will
> work the same in all versions after that.  My proposal will break this
> expectation only in the case of '\N{reserved-NNNN}'.  Once a code point
> NNNN is assigned '\N{reserved-NNNN}' will become a syntax error.
>
I think that's to be expected. When a codepoint is assigned, it's no
longer "reserved". It's just unfortunate that it'll break the code.

> If you agree that this is the only problematic case, let's focus on it.
>   I cannot think of any reason to deliberately use reserved characters
> other than to stress-test your unicode handling software.  In this
> application, you probably want to see an error once NNNN is assigned
> because your tests will no longer cover the unassigned character case.
>
Actually putting '\N{reserved-NNNN}' in your code would be a bad idea
because at some point in the future you won't be able to run the code
at all!

> Can you suggest any other use?
>