[Python-ideas] Unicode Name Aliases keyword argument abbreviation in unicodedata.name for missing names

Thu Jul 12 13:09:50 EDT 2018

On 2018-07-12 16:02, Stephen J. Turnbull wrote:
> Robert Vanden Eynde writes:
> 
>   > As I'm at, I mentionned the ffef character but we don't care about
>   > it because it already has a name, so that's mostly a control
>   > character issue.
> 
> The problem with control characters is that from the point of view of
> the Unicode Standard, the C0 and C1 registers are basically a space
> reserved for private use (see ISO 6429 for the huge collection of
> standardized control functions).  That is, unlike the rest of the
> Unicode repertoire, the "characters" mapped there are neither unique
> nor context-independent.  It's true that ISO 6429 recommends specific
> C0 and C1 sets (but the recommended C1 set isn't even complete:
> U+0080, U+0081, and U+0099 aren't assigned!)  However, Unicode only
> suggests that those should be the default interpretations, because the
> useful control functions are going to be dependent on context (eg,
> input and output devices).
> 
> This is like the situation with Internet addresses and domain names.
> The mapping is inherently many-many; round-tripping is not possible.
> 
> And in fact there are a number of graphic characters that have
> multiple code points due to bugs in national character sets.  So for
> graphic characters, it's possible to ensure name(code(x)) = x, but
> it's not possible to ensure code(name(x)) = x, except under special
> circumstances (which apply to the vast majority of characters, of
> course).
> 
>   > I like your alias(...) function, with that one, an application
>   > could code my function like try name(x) expect
>   > alias(x).abbreviations[0]. If the abbreviation list is sorted by
>   > AdditionToUnicodeDate.
> 
> I don't understand why that's particularly useful, especially in the
> Han case (see below).
> 
>   > However, having a standard canonical name for all character in the
>   > stdlib would help people choosing the same convention. A new
>   > function like "canonical_name" or a shorter name would be an idea.
> 
> I don't understand what you're asking for.  The Unicode Standard
> already provides canonical names.  Of course, the canonical name of
> most Han ideographs (near and dear to my heart) are pretty useless
> (they look like "CJK UNIFIED IDEOGRAPH-4E00").  (You probably don't
> want to get the Japanese, Chinese---and there are a lot of different
> kinds of Chinese---and Koreans started on what the "canonical" name
> should be.  One Han Unification controversy is enough for this
> geological epoch!)  This is closely related to the Unicode standard's
> generic recommendation (Ch. 4.8):
> 
>      On the other hand, an API which returns a name for Unicode code
>      points, but which is expected to provide useful, unique labels for
>      unassigned, reserved code points and other special code point
>      types, should return the value of the Unicode Name property for
>      any code point for which it is non-null, but should otherwise con-
>      struct a code point label to stand in for a character name.
> 
> (I suppose "should" here is used in the sense of RFC 2119.)  So, the
> standard defines a canonical naming scheme, although many character
> names are not terribly mnemonic even to native speakers.
> 
> On the other hand, if you want useful aliases for Han characters, for
> many of them there could be scores of aliases, based on pronunciation,
> semantics, and appearance, the first two of which of which vary
> substantially within a single language, let alone across languages.
> Worse, as far as I know there are no standard equivalent ways to
> express these things in English, as when writing about these
> characters in English you often adopt a romanized version of the
> technical terms in the language you're studying.  And, it's a minor
> point, but there are new Han characters discovered every day (I'm not
> even sure that's an exaggeration), as scholars examine regional and
> historical documents.
> 
> So for this to be most useful to me, I would want it developed OUTSIDE
> of the stdlib, with releases even more frequent than pytz (that is an
> exaggeration).  Not so much because I'll frequently need anything
> outside of the main CJK block in Plane 0, but because the complexity
> of character naming in East Asia suggests that improvements in
> heuristics for assigning priority to aliases, language-specific
> variations in heuristics, and so on will be rapid for the forseeable
> future.  It would be a shame to shackle that to the current stdlib
> release cycle even if it doesn't need to be as frenetic as pytz.  This
> goes in spades for people who are waiting for their own scripts to be
> standardized.
> 
> For the stdlib, I'm -1 on anything other than the canonical names plus
> the primary aliases for characters which are well-defined in the code
> charts of the Unicode Standard, such as those for the C0 and (most of)
> the C1 control characters.  And even there I think a canonical name
> based on block name + code point in hex is the best way to go.
> 
> I think this problem is a lot harder than many of the folk
> participating in this discussion so far realize.
> 
AFAIR, the last time codepoint names were talked about, I suggested that 
there could be a fallback to U+XXXX. unicodedata.name can accept a 
default, but unicodedata.lookup doesn't accept such a 'name'.