[Python-ideas] Unicode Name Aliases keyword argument abbreviation in unicodedata.name for missing names

Mon Jul 16 01:27:57 EDT 2018

Steven D'Aprano writes:

 > Sorry, I'm not sure if you mean my proposed alias() function isn't
 > useful, or Robert's try...except loop around it.

I was questioning the utility of "If the abbreviation list is sorted
by AdditionToUnicodeDate."

But since you ask, neither function is useful TO ME, as I understand
them, because they're based on the UCD NameAliases.txt.  That doesn't
have any aliases I would actually use.  I've never needed aliases for
control characters, and for everything else the canonical name is
perfectly useful (including for Korean characters and Japanese kana,
which have phonetic names, as do Chinese bopomofo AIUI).  There's
nothing useful for Han characters yet, sadly.

 > My alias() function is just an programmatic interface to information 
 > already available in the NameAliases.txt file. Don't you think that's 
 > useful enough as it stands?

To be perfectly frank, if that's all it is, I don't know when I'd ever
use it.  Your label function is *much* more useful.

To be specific about the defects of NameAliases.txt: "DEVICE CONTROL
1" tells me a lot less about that control character than "U+0011"
does.  Other aliases in that file are just wrong: I don't believe I've
ever seen U+001A used as "SUBSTITUTE" for an unrepresentable coded
character entity.  That's the DOS "END OF FILE".  Certainly, the
aliases of category "correction" are useful, though not to me---I
don't read any of the relevant languages.  The "figment" category is
stupid; almost all the names of control characters are figments,
except for the half-dozen well-known whitespace characters, NUL, and
maybe DEL.  The 256 VSxxx "variation selectors" are somewhat useful,
but I would think that it would be even more useful to provide skin
color aliases for face emoji and X11 RGB.txt color aliases for hearts
and the like, which presumably are standardized across vendors.

If I were designing a feature for the stdlib, I would

0.  Allow the database to consist of multiple alias tables, and be
    extensible by adding tables via user configuration.
1.  Make the priority of the alias tables user-configurable.
2.  Provide default top-priority table more suited to likely Python
    usage than NameAliases.txt.
3.  Provide both a primary alias function, and a list of all aliases
    function.
4.  Provide a reverse lookup function.
5.  Perhaps provide a context-sensitive alias function.  The only
    context I can think of offhand is "position in file", ie, to
    distinguish between ZWNBSP and BOM, so perhaps that's not worth
    doing.  On the other hand, given that example, it's worth a few
    minutes thought to see if there are other context-sensitive naming
    practices that more than a few people would want to follow.

 > Indeed. [Multiple non-UCD aliases is] also the case for
 > emoji. That's why I suggested making alias() return a mutable
 > record rather than an immutable tuple, so application writers can
 > add their own records to suit their own needs.

Why should they add them to the tuple returned by the function, rather
than to the database the function consults?

 > fully-fledged PEP -- but I think the critical point here is that we 
 > shouldn't be privileging one alias type over the others.

I don't understand.  By providing stdlib support for NameAliases.txt
only, you are privileging those aliases.  If you mean privileging the
Name property over the aliases, well, that's what "canonical" means,
and yes, I think the Name property should be privileged (eg ZERO WIDTH
NO-BREAK SPACE over BYTE ORDER MARK).

 > That seems fairly extreme. New Unicode versions don't come out that
 > frequently. Surely we don't expect to track draft aliases, or
 > characters outside of Unicode?

Why not track draft aliases in a "draft alias" table?  More important,
why not track aliases of *Unicode* characters that could use aliases
(eg, translations), in separate tables?  For example, there are "shape
based names" for Han characters, which are standard enough so that
users would be able to construct them (Unicode 11 includes one such
system, see section 18.2).  And Japanese names for Han radicals often
vary from the UCD Name property, and are often more precise (many
describe the geometric relation of the radical to the rest of the
character).

It is not obvious to me that an alias() that only looks at
NameAliases.txt is so useful as to belong in the stdlib, but on the
other hand providing a module that can include rapidly accumulating
databases along the lines I've mentioned above definitely doesn't
belong in the stdlib (a la pytz).

On the other hand, the *access functions* might belong in the stdlib
---in the same way that timezone-sensitive datetime APIs do---but that
sort of requires knowing what databases and "schema" are out there,
and trying to set things up so that the same APIs can access a number
of databases.

 > To clarify, do you mean the aliases defined in NameAliases.txt? Or a 
 > subset of them?

I didn't understand your alias function correctly, which I think is
overengineered for the purpose of handling aliases.  I was thinking in
terms of returning a string, or at most general a list of strings.  If
you are going to define a class to represent metadata about a
character, why not make *all* metadata available?  Probably most of
the attributes would be properties, lazily accessing various
databases:

class Codepoint(object):
    def __init__(self, codepoint):
        self.codepoint = codepoint

    @property
    def name(self):
        # Access name database and cache result.

    @property
    def category(self):
        # Access category database and cache result.

    @property
    def alias(self):
        # Populates alias_list, and returns the first one.

    @property
    def alias_list(self):
        # Access alias database (not limited to NameAliases.txt) and
        # cache result.

    @property
    def label(self):
        # Populates and returns name, if available, otherwise a code
        # point label.

and so on.  But that's a new thread.

 > > And even there I think a canonical name based on block name +
 > > code point in hex is the best way to go.
 > 
 > I believe you might be thinking of the Unicode "code point label" 
 > concept.

Yes, as MRAB has suggested.  I would be a little more precise than he,
in that I would label the C0 and C1 control blocks with CONTROL-<code>
rather than just U+<code>.

Steve