Re: [Python-ideas] Support Unicode code point labels (Was: notation)
I am starting a new thread to discuss an idea that is orthogonal to Steven D'Aprano's \U+NNNN proposal. The Unicode Standard defines five types of code points for which it does not provide a unique Name property. These types are: Control, Reserved, Noncharacter, Private-use and Surrogate. When a unique descriptive label is required for any such code point, the standard recommends constructing a label as follows: "For each code point type without character names, code point labels are constructed by using a lowercase prefix derived from the code point type, followed by a hyphen-minus and then a 4- to 6-digit hexadecimal representation of the code point." I propose adding support for these labels to unicodedata.lookup(), \N{..} and unicodedata.name() (or unicodedata.label()). In the previous thread, there was a disagreement on whether invalid labels (such as reserved-0009 instead of control-0009) should be accepted. I will address this in my response to Stephen Turnbull's e-mail below. Another question is how to add support for generating the labels in a backward compatible manner. Currently unicodedata.name() raises ValueError when no name is available for a code point:
unicodedata.name(chr(0x0009)) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: no such name
Since unicodedata.name() also supports specifying default, it is unlikely
that users write code like this
try:
name = unicodedata.name(x)
except ValueError:
name = 'U+{:04X}'.format(ord(x))
instead of
name = unicodedata.name(x, '') or 'U+{:04X}'.format(ord(x))
However, the most conservative approach is not to change the behavior of
unicodedata.name() and provide a new function unicodedata.label().
On Fri, Aug 2, 2013 at 3:36 AM, Stephen J. Turnbull
Alexander Belopolsky writes:
.. why would you write \N{reserved-NNNN} instead of \uNNNN to begin with?
I wouldn't. The problem isn't writing "\N{reserved-50000}". It's the other way around: I want to *write* "\N{control-50000}" which expresses my intent in Python 3.5 and not have it blow up in Python 3.4 which uses an older UCD where U+50000 is unassigned.
"\N{control-50000}" will blow up in every past, present or future Python version. Since Unicode 1.1.5, "The General_Category property value Control (Cc) is immutable: the set of code points with that value will never change." http://www.unicode.org/policies/stability_policy.html
With the possible exception or reserved-, on a rare occasion when you want to be explicit about the character type, it is useful to be strict.
As explained above, strictness is not backward compatible with older versions of the UCD that might be in use in older versions of Python.
This is not an issue for versions of Python that currently exist because they do not support \N{<type-prefix>-NNNN} syntax at all. What may happen if my proposal is accepted is that \N{reserved-50000} will be valid in Python 3.N but invalid in 3.N+1 for some N > 3. If this becomes an issue, we can solve this problem when the time comes. It is always easier to relax the rules than to make them stricter. Yet, I still don't see the problem. You can already write assert unicodedata.category(chr(0x50000)) == 'Cn' in your code and this will blow up in any future version that will use UCD with U+50000 assigned. You can think of "\N{<type-prefix>-NNNN}" as a syntax sugar for "\uNNNN" followed by an assert.
Alexander Belopolsky writes:
Yet, I still don't see the problem.
You can already write assert unicodedata.category(chr(0x50000)) == > 'Cn' in your code and this will blow up in any future version that will use UCD with U+50000 assigned.
That's not a problem. As you say, "presumably you're doing that for good reason." The problem is that someone will use code written by someone using a future version and run it with a past version, and the assert will trigger. I don't see any good reason why it should. The Unicode Standard explicitly specifies how unknown code points should be handled. Raising an exception is not part of that spec.
On Sat, Aug 3, 2013 at 10:35 AM, Stephen J. Turnbull
The problem is that someone will use code written by someone using a future version and run it with a past version, and the assert will trigger. I don't see any good reason why it should. The Unicode Standard explicitly specifies how unknown code points should be handled. Raising an exception is not part of that spec.
It looks like we are running into a confusion between code points and their name property. I agree that a conforming function that returns the name for a code point should not raise an exception. The standard offers two alternatives: return an empty string or return a generated label. In this sense unicodedata.name() is not conforming:
unicodedata.name('\u0009') Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: no such name
However, it is trivial to achieve conforming behavior:
unicodedata.name('\u0009', '') ''
I propose adding unicode.label() function that will return that will return 'control-0009' in this case. I think this proposal is fully inline with the standard. For the inverse operation, unicodedata.lookup(), I don't see anything in the standard that precludes raising an exception on an unknown name. If that was a problem, we would have it already. In Python >=3.2:
unicodedata.lookup('BELL') '🔔'
But in Python 3.1:
unicodedata.lookup('BELL') Traceback (most recent call last): File "<stdin>", line 1, in <module> KeyError: "undefined character name 'BELL'"
The only potential problem that I see with my proposal is that it is reasonable to expect that if '\N{whatever}' works in one version it will work the same in all versions after that. My proposal will break this expectation only in the case of '\N{reserved-NNNN}'. Once a code point NNNN is assigned '\N{reserved-NNNN}' will become a syntax error. If you agree that this is the only problematic case, let's focus on it. I cannot think of any reason to deliberately use reserved characters other than to stress-test your unicode handling software. In this application, you probably want to see an error once NNNN is assigned because your tests will no longer cover the unassigned character case. Can you suggest any other use?
On 03/08/2013 21:59, Alexander Belopolsky wrote:
On Sat, Aug 3, 2013 at 10:35 AM, Stephen J. Turnbull
mailto:stephen@xemacs.org> wrote: The problem is that someone will use code written by someone using a future version and run it with a past version, and the assert will trigger. I don't see any good reason why it should. The Unicode Standard explicitly specifies how unknown code points should be handled. Raising an exception is not part of that spec.
It looks like we are running into a confusion between code points and their name property. I agree that a conforming function that returns the name for a code point should not raise an exception. The standard offers two alternatives: return an empty string or return a generated label. In this sense unicodedata.name http://unicodedata.name() is not conforming:
unicodedata.name http://unicodedata.name('\u0009') Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: no such name
However, it is trivial to achieve conforming behavior:
unicodedata.name('\u0009', '') ''
I propose adding unicode.label() function that will return that will return 'control-0009' in this case. I think this proposal is fully inline with the standard.
For the inverse operation, unicodedata.lookup(), I don't see anything in the standard that precludes raising an exception on an unknown name. If that was a problem, we would have it already.
In Python >=3.2:
unicodedata.lookup('BELL') '🔔'
But in Python 3.1:
unicodedata.lookup('BELL') Traceback (most recent call last): File "<stdin>", line 1, in <module> KeyError: "undefined character name 'BELL'"
The only potential problem that I see with my proposal is that it is reasonable to expect that if '\N{whatever}' works in one version it will work the same in all versions after that. My proposal will break this expectation only in the case of '\N{reserved-NNNN}'. Once a code point NNNN is assigned '\N{reserved-NNNN}' will become a syntax error.
I think that's to be expected. When a codepoint is assigned, it's no longer "reserved". It's just unfortunate that it'll break the code.
If you agree that this is the only problematic case, let's focus on it. I cannot think of any reason to deliberately use reserved characters other than to stress-test your unicode handling software. In this application, you probably want to see an error once NNNN is assigned because your tests will no longer cover the unassigned character case.
Actually putting '\N{reserved-NNNN}' in your code would be a bad idea because at some point in the future you won't be able to run the code at all!
Can you suggest any other use?
Alexander Belopolsky writes:
It looks like we are running into a confusion between code points and their name property.
No. There is no confusion, not on my part, anyway. Let me state my position in full, without rationale. The name property is an entry in the UCD, indexed by code point. unidata.name(codepoint) should return that property, and nothing else (perhaps returning an empty string or placeholder string that cannot be a character name, instead of an exception, for a codepoint that doesn't have a name property). The code_point_type-nnnn construct, a label, should never be returned by unidata.name(). I have no position on whether a new method such as .label() should be added to support deriving labels from code points. (I suspect it's a YAGNI but I have no good evidence for that.) The question is how to handle strings that purport to uniquely describe some Unicode code point. First, since the code point is described uniquely, I see no need (except perhaps backward compatibility) for a new method. If the backward incompatibility is judged small (I think it is), then the .lookup() method should be extended to handle strings that are not the name property of any character. Second, if the string argument is the name property of a Unicode character, that character's code point should be returned. I think these first two points are non-controversial (modulo one's opinion on the backwark compatibility issue). Third, I contend that unidata.lookup() should recognize the "U+nnnn" format and return int("nnnn", 16). Further, use of "\N{U+nnnn}" is preferable to Steven's proposed "\U+nnnn" escape sequence, or variants using braces to delimit the code point. Fourth, I find it acceptable that unidata.lookup() should recognize the "code_point_type-nnnn" label format for any code_point_type defined in Unicode, and return int("nnnn", 16). (Again, personally I think it's a YAGNI, but others might find the redundant code point type information useful for consistency checking.) Further, unidata.lookup() should not raise an exception if code_point_type is inconsistent with nnnn. This consistency checking should be left up to programs like pylint.
On 4 August 2013 18:16, Stephen J. Turnbull
Alexander Belopolsky writes:
It looks like we are running into a confusion between code points and their name property.
No. There is no confusion, not on my part, anyway. Let me state my position in full, without rationale.
And just for the record: my position is consistent with Stephen's, including the "You Ain't Gonna Need It" call for the "code_point_type-nnnn" format. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
participants (4)
-
Alexander Belopolsky
-
MRAB
-
Nick Coghlan
-
Stephen J. Turnbull