[Python-ideas] Unicode Name Aliases keyword argument abbreviation in unicodedata.name for missing names

Stephen J. Turnbull turnbull.stephen.fw at u.tsukuba.ac.jp
Thu Jul 12 11:02:20 EDT 2018


Robert Vanden Eynde writes:

 > As I'm at, I mentionned the ffef character but we don't care about
 > it because it already has a name, so that's mostly a control
 > character issue.

The problem with control characters is that from the point of view of
the Unicode Standard, the C0 and C1 registers are basically a space
reserved for private use (see ISO 6429 for the huge collection of
standardized control functions).  That is, unlike the rest of the
Unicode repertoire, the "characters" mapped there are neither unique
nor context-independent.  It's true that ISO 6429 recommends specific
C0 and C1 sets (but the recommended C1 set isn't even complete:
U+0080, U+0081, and U+0099 aren't assigned!)  However, Unicode only
suggests that those should be the default interpretations, because the
useful control functions are going to be dependent on context (eg,
input and output devices).

This is like the situation with Internet addresses and domain names.
The mapping is inherently many-many; round-tripping is not possible.

And in fact there are a number of graphic characters that have
multiple code points due to bugs in national character sets.  So for
graphic characters, it's possible to ensure name(code(x)) = x, but
it's not possible to ensure code(name(x)) = x, except under special
circumstances (which apply to the vast majority of characters, of
course).

 > I like your alias(...) function, with that one, an application
 > could code my function like try name(x) expect
 > alias(x).abbreviations[0]. If the abbreviation list is sorted by
 > AdditionToUnicodeDate.

I don't understand why that's particularly useful, especially in the
Han case (see below).

 > However, having a standard canonical name for all character in the
 > stdlib would help people choosing the same convention. A new
 > function like "canonical_name" or a shorter name would be an idea.

I don't understand what you're asking for.  The Unicode Standard
already provides canonical names.  Of course, the canonical name of
most Han ideographs (near and dear to my heart) are pretty useless
(they look like "CJK UNIFIED IDEOGRAPH-4E00").  (You probably don't
want to get the Japanese, Chinese---and there are a lot of different
kinds of Chinese---and Koreans started on what the "canonical" name
should be.  One Han Unification controversy is enough for this
geological epoch!)  This is closely related to the Unicode standard's
generic recommendation (Ch. 4.8):

    On the other hand, an API which returns a name for Unicode code
    points, but which is expected to provide useful, unique labels for
    unassigned, reserved code points and other special code point
    types, should return the value of the Unicode Name property for
    any code point for which it is non-null, but should otherwise con-
    struct a code point label to stand in for a character name.

(I suppose "should" here is used in the sense of RFC 2119.)  So, the
standard defines a canonical naming scheme, although many character
names are not terribly mnemonic even to native speakers.

On the other hand, if you want useful aliases for Han characters, for
many of them there could be scores of aliases, based on pronunciation,
semantics, and appearance, the first two of which of which vary
substantially within a single language, let alone across languages.
Worse, as far as I know there are no standard equivalent ways to
express these things in English, as when writing about these
characters in English you often adopt a romanized version of the
technical terms in the language you're studying.  And, it's a minor
point, but there are new Han characters discovered every day (I'm not
even sure that's an exaggeration), as scholars examine regional and
historical documents.

So for this to be most useful to me, I would want it developed OUTSIDE
of the stdlib, with releases even more frequent than pytz (that is an
exaggeration).  Not so much because I'll frequently need anything
outside of the main CJK block in Plane 0, but because the complexity
of character naming in East Asia suggests that improvements in
heuristics for assigning priority to aliases, language-specific
variations in heuristics, and so on will be rapid for the forseeable
future.  It would be a shame to shackle that to the current stdlib
release cycle even if it doesn't need to be as frenetic as pytz.  This
goes in spades for people who are waiting for their own scripts to be
standardized.

For the stdlib, I'm -1 on anything other than the canonical names plus
the primary aliases for characters which are well-defined in the code
charts of the Unicode Standard, such as those for the C0 and (most of)
the C1 control characters.  And even there I think a canonical name
based on block name + code point in hex is the best way to go.

I think this problem is a lot harder than many of the folk
participating in this discussion so far realize.

Steve



More information about the Python-ideas mailing list