[Python-ideas] Support Unicode code point labels (Was: notation)

Sun Aug 4 10:16:22 CEST 2013

Alexander Belopolsky writes:

 > It looks like we are running into a confusion between code points
 > and their name property.

No.  There is no confusion, not on my part, anyway.  Let me state my
position in full, without rationale.

The name property is an entry in the UCD, indexed by code point.
unidata.name(codepoint) should return that property, and nothing else
(perhaps returning an empty string or placeholder string that cannot
be a character name, instead of an exception, for a codepoint that
doesn't have a name property).

The code_point_type-nnnn construct, a label, should never be returned
by unidata.name().  I have no position on whether a new method such as
.label() should be added to support deriving labels from code points.
(I suspect it's a YAGNI but I have no good evidence for that.)

The question is how to handle strings that purport to uniquely
describe some Unicode code point.

First, since the code point is described uniquely, I see no need
(except perhaps backward compatibility) for a new method.  If the
backward incompatibility is judged small (I think it is), then the
.lookup() method should be extended to handle strings that are not the
name property of any character.

Second, if the string argument is the name property of a Unicode
character, that character's code point should be returned.  I think
these first two points are non-controversial (modulo one's opinion on
the backwark compatibility issue).

Third, I contend that unidata.lookup() should recognize the "U+nnnn"
format and return int("nnnn", 16).  Further, use of "\N{U+nnnn}" is
preferable to Steven's proposed "\U+nnnn" escape sequence, or variants
using braces to delimit the code point.

Fourth, I find it acceptable that unidata.lookup() should recognize
the "code_point_type-nnnn" label format for any code_point_type
defined in Unicode, and return int("nnnn", 16).  (Again, personally I
think it's a YAGNI, but others might find the redundant code point
type information useful for consistency checking.)  Further,
unidata.lookup() should not raise an exception if code_point_type is
inconsistent with nnnn.  This consistency checking should be left up
to programs like pylint.