[Python-ideas] Unicode Name Aliases keyword argument abbreviation in unicodedata.name for missing names

Steven D'Aprano steve at pearwood.info
Thu Jul 12 03:17:26 EDT 2018


Replying to a few points out of order...

On Thu, Jul 12, 2018 at 02:03:07AM +0000, Robert Vanden Eynde wrote:

> lookup(name(x)) == x for all x is natural isn't it ?

The Unicode Consortium doesn't think so, or else they would mandate that 
all defined code points have a name.


> In the NameAliases 
> https://www.unicode.org/Public/6.3.0/ucd/NameAliases.txt
> one can see that some characters have multiple aliases, so there are 
> multiple ways to map a character to a name.

That's a pretty old version -- we're up to version 11 now.

https://www.unicode.org/Public/11.0.0/ucd/NameAliases.txt


> I propose adding a keyword argument, to 
> unicodedata.name<http://unicodedata.name> 

I don't think that's a real URL.


> that would implement one of some useful behavior when the value does 
> not exist.

I am cautious about overloading functions with keyword-only arguments to 
implement special behaviour. Guido has a strong preference for the "no 
constant flags" rule of thumb, (except I think we can extend it beyond 
just True/False to any N-state value) and I agree with that.

The rule of thumb says that if you have a function that takes an 
optional flag which chooses between two (or more) distinct behaviours, 
AND the function is usually called with that flag given as a constant, 
then we should usually prefer to split the function into two separately 
named functions.

For example, in the statistics module, I have stdev() and pstdev(), 
rather than stdev(population=False) and stdev(population=True).

(Its a rule of thumb, not a hard law of nature. There are exceptions.)

It sounds to me that your proposal would fit those conditions and so we 
should prefer a separate function, or a separate API, for doing more 
complex name look-ups.

*Especially* if there's a chance that we'll want to extend this some day 
to use more flags...

name(char, 
     abbreviation=False,
     correction=True,
     control=True,
     figment=True,
     alternate=False,
     )

which are all alias types defined by NameAliases.txt.


> One simple behavior would be to chose the name in the "abbreviation" 
> list. Currently all characters except three only have one and only one 
> abbreviation so that would be a good pick, so I'd imagine name('\x00', 
> abbreviation=True) == 'NUL'

To my mind, that calls out for a separate API to return character alias 
properties as a separate data type:

alias('\u0001')
=> UnicodeAlias(control='START OF HEADING', abbreviation='SOH')

alias('\u000B')
=> UnicodeAlias(control=('LINE TABULATION', 'VERTICAL TABULATION'),
                abbreviation='VT')

# alternatively, fields could be a single semi-colon delimited 
string rather than a tuple in the event of multiple aliases

alias('\u01A2')
=> UnicodeAlias(correction='LATIN CAPITAL LETTER GHA')

alias('\u0099')
=> UnicodeAlias(figment='SINGLE GRAPHIC CHARACTER INTRODUCER', 
                abbreviation='SGC')


Fields not shown return the empty string.

This avoids overloading the name() function, future-proofs against new 
alias types, and if UnicodeAlias is a mutable object, easily permits the 
caller to customise the records to suit their own application's needs:

def myalias(char):
    alias = unicodedata.alias(char)
    if char == '\U0001f346':
        alias.other = ('eggplant', 'purple vegetable')
        alias.slang = ('phallic', ... )
    return alias


-- 
Steve


More information about the Python-ideas mailing list