[Python-ideas] Unicode Name Aliases keyword argument abbreviation in unicodedata.name for missing names

Steven D'Aprano steve at pearwood.info
Thu Jul 12 13:14:45 EDT 2018


On Fri, Jul 13, 2018 at 12:02:20AM +0900, Stephen J. Turnbull wrote:

>  > I like your alias(...) function, with that one, an application
>  > could code my function like try name(x) expect
>  > alias(x).abbreviations[0]. If the abbreviation list is sorted by
>  > AdditionToUnicodeDate.
> 
> I don't understand why that's particularly useful, especially in the
> Han case (see below).

Sorry, I'm not sure if you mean my proposed alias() function isn't 
useful, or Robert's try...except loop around it.

My alias() function is just an programmatic interface to information 
already available in the NameAliases.txt file. Don't you think that's 
useful enough as it stands?

What people do with it will depend on their application, of course.


[...]
> On the other hand, if you want useful aliases for Han characters, for
> many of them there could be scores of aliases, based on pronunciation,
> semantics, and appearance, the first two of which of which vary
> substantially within a single language, let alone across languages.

Indeed. That's also the case for emoji. That's why I suggested making 
alias() return a mutable record rather than an immutable tuple, so 
application writers can add their own records to suit their own needs.

I'll admit I haven't thought really deeply about what the most useful 
API would be -- this was only an initial post on Python-Ideas, not a 
fully-fledged PEP -- but I think the critical point here is that we 
shouldn't be privileging one alias type over the others. The 
NameAlias.txt file makes all that information available, but we can't 
access it (easily, or at all) from unicodedata.

[...]
> So for this to be most useful to me, I would want it developed OUTSIDE
> of the stdlib, with releases even more frequent than pytz (that is an
> exaggeration).

That seems fairly extreme. New Unicode versions don't come out that 
frequently. Surely we don't expect to track draft aliases, or characters 
outside of Unicode?

Application writers might choose to do so -- if somebody wants to 
support "Ay" as an alias for LATIN CAPITAL LETTER A they can be my 
guest, but the stdlib doesn't have to directly support it until it hits 
the NameAliases.txt file :-)

[...]
> For the stdlib, I'm -1 on anything other than the canonical names plus
> the primary aliases for characters which are well-defined in the code
> charts of the Unicode Standard, such as those for the C0 and (most of)
> the C1 control characters.

To clarify, do you mean the aliases defined in NameAliases.txt? Or a 
subset of them?


> And even there I think a canonical name
> based on block name + code point in hex is the best way to go.

I believe you might be thinking of the Unicode "code point label" 
concept. I have this implementation in my toolbox:


NONCHARACTERS = tuple(
    [unichr(n) for n in range(0xFDD0, 0xFDF0)] +
    [unichr(n*0x10000 + 0xFFFE +i) for n in range(17) for i in range(2)]
    )
assert len(NONCHARACTERS) == 66

def label(c):
    """Return the Code Point Label or character name of c.

    If c is a code point with a name, the name is used as the label;
    otherwise the Code Point Label is returned.

    >>> label(unichr(0x0394))  # u'Δ'
    'GREEK CAPITAL LETTER DELTA'
    >>> label(unichr(0x001F))
    '<control-001F>'

    """
    name = unicodedata.name(c, '')
    if name == '':
        # See section on Code Point Labels
        # http://www.unicode.org/versions/Unicode10.0.0/ch04.pdf
        number = ord(c)
        category = unicodedata.category(c)
        assert category in ('Cc', 'Cn', 'Co', 'Cs')
        if category == 'Cc':
            kind = 'control'
        elif category == 'Cn':
            if c in NONCHARACTERS:
                kind = 'noncharacter'
            else:
                kind = 'reserved'
        elif category == 'Co':
            kind = 'private-use'
        else:
            assert category == 'Cs'
            kind = 'surrogate'
        name = "<%s-%04X>" % (kind, number)
    return name



-- 
Steve


More information about the Python-ideas mailing list