[Python-ideas] Unicode Name Aliases keyword argument abbreviation in unicodedata.name for missing names
Steven D'Aprano
steve at pearwood.info
Thu Jul 12 13:14:45 EDT 2018
On Fri, Jul 13, 2018 at 12:02:20AM +0900, Stephen J. Turnbull wrote:
> > I like your alias(...) function, with that one, an application
> > could code my function like try name(x) expect
> > alias(x).abbreviations[0]. If the abbreviation list is sorted by
> > AdditionToUnicodeDate.
>
> I don't understand why that's particularly useful, especially in the
> Han case (see below).
Sorry, I'm not sure if you mean my proposed alias() function isn't
useful, or Robert's try...except loop around it.
My alias() function is just an programmatic interface to information
already available in the NameAliases.txt file. Don't you think that's
useful enough as it stands?
What people do with it will depend on their application, of course.
[...]
> On the other hand, if you want useful aliases for Han characters, for
> many of them there could be scores of aliases, based on pronunciation,
> semantics, and appearance, the first two of which of which vary
> substantially within a single language, let alone across languages.
Indeed. That's also the case for emoji. That's why I suggested making
alias() return a mutable record rather than an immutable tuple, so
application writers can add their own records to suit their own needs.
I'll admit I haven't thought really deeply about what the most useful
API would be -- this was only an initial post on Python-Ideas, not a
fully-fledged PEP -- but I think the critical point here is that we
shouldn't be privileging one alias type over the others. The
NameAlias.txt file makes all that information available, but we can't
access it (easily, or at all) from unicodedata.
[...]
> So for this to be most useful to me, I would want it developed OUTSIDE
> of the stdlib, with releases even more frequent than pytz (that is an
> exaggeration).
That seems fairly extreme. New Unicode versions don't come out that
frequently. Surely we don't expect to track draft aliases, or characters
outside of Unicode?
Application writers might choose to do so -- if somebody wants to
support "Ay" as an alias for LATIN CAPITAL LETTER A they can be my
guest, but the stdlib doesn't have to directly support it until it hits
the NameAliases.txt file :-)
[...]
> For the stdlib, I'm -1 on anything other than the canonical names plus
> the primary aliases for characters which are well-defined in the code
> charts of the Unicode Standard, such as those for the C0 and (most of)
> the C1 control characters.
To clarify, do you mean the aliases defined in NameAliases.txt? Or a
subset of them?
> And even there I think a canonical name
> based on block name + code point in hex is the best way to go.
I believe you might be thinking of the Unicode "code point label"
concept. I have this implementation in my toolbox:
NONCHARACTERS = tuple(
[unichr(n) for n in range(0xFDD0, 0xFDF0)] +
[unichr(n*0x10000 + 0xFFFE +i) for n in range(17) for i in range(2)]
)
assert len(NONCHARACTERS) == 66
def label(c):
"""Return the Code Point Label or character name of c.
If c is a code point with a name, the name is used as the label;
otherwise the Code Point Label is returned.
>>> label(unichr(0x0394)) # u'Δ'
'GREEK CAPITAL LETTER DELTA'
>>> label(unichr(0x001F))
'<control-001F>'
"""
name = unicodedata.name(c, '')
if name == '':
# See section on Code Point Labels
# http://www.unicode.org/versions/Unicode10.0.0/ch04.pdf
number = ord(c)
category = unicodedata.category(c)
assert category in ('Cc', 'Cn', 'Co', 'Cs')
if category == 'Cc':
kind = 'control'
elif category == 'Cn':
if c in NONCHARACTERS:
kind = 'noncharacter'
else:
kind = 'reserved'
elif category == 'Co':
kind = 'private-use'
else:
assert category == 'Cs'
kind = 'surrogate'
name = "<%s-%04X>" % (kind, number)
return name
--
Steve
More information about the Python-ideas
mailing list