[Python-ideas] Support Unicode code point labels (Was: notation)
MRAB
python at mrabarnett.plus.com
Sun Aug 4 01:57:13 CEST 2013
On 03/08/2013 21:59, Alexander Belopolsky wrote:
>
> On Sat, Aug 3, 2013 at 10:35 AM, Stephen J. Turnbull <stephen at xemacs.org
> <mailto:stephen at xemacs.org>> wrote:
>
> The problem is that someone will use code written by someone using a
> future version and run it with a past version, and the assert will
> trigger. I don't see any good reason why it should. The Unicode
> Standard explicitly specifies how unknown code points should be
> handled. Raising an exception is not part of that spec.
>
>
> It looks like we are running into a confusion between code points and
> their name property. I agree that a conforming function that returns
> the name for a code point should not raise an exception. The standard
> offers two alternatives: return an empty string or return a generated
> label. In this sense unicodedata.name <http://unicodedata.name>() is
> not conforming:
>
> >>> unicodedata.name <http://unicodedata.name>('\u0009')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> ValueError: no such name
>
> However, it is trivial to achieve conforming behavior:
>
> >>> unicodedata.name('\u0009', '')
> ''
>
> I propose adding unicode.label() function that will return that will
> return 'control-0009' in this case. I think this proposal is fully
> inline with the standard.
>
> For the inverse operation, unicodedata.lookup(), I don't see anything in
> the standard that precludes raising an exception on an unknown name. If
> that was a problem, we would have it already.
>
> In Python >=3.2:
>
> >>> unicodedata.lookup('BELL')
> '🔔'
>
> But in Python 3.1:
>
> >>> unicodedata.lookup('BELL')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> KeyError: "undefined character name 'BELL'"
>
> The only potential problem that I see with my proposal is that it is
> reasonable to expect that if '\N{whatever}' works in one version it will
> work the same in all versions after that. My proposal will break this
> expectation only in the case of '\N{reserved-NNNN}'. Once a code point
> NNNN is assigned '\N{reserved-NNNN}' will become a syntax error.
>
I think that's to be expected. When a codepoint is assigned, it's no
longer "reserved". It's just unfortunate that it'll break the code.
> If you agree that this is the only problematic case, let's focus on it.
> I cannot think of any reason to deliberately use reserved characters
> other than to stress-test your unicode handling software. In this
> application, you probably want to see an error once NNNN is assigned
> because your tests will no longer cover the unassigned character case.
>
Actually putting '\N{reserved-NNNN}' in your code would be a bad idea
because at some point in the future you won't be able to run the code
at all!
> Can you suggest any other use?
>
More information about the Python-ideas
mailing list