[Python-ideas] Unicode Name Aliases keyword argument abbreviation in unicodedata.name for missing names
steve at pearwood.info
Thu Jul 12 03:17:26 EDT 2018
Replying to a few points out of order...
On Thu, Jul 12, 2018 at 02:03:07AM +0000, Robert Vanden Eynde wrote:
> lookup(name(x)) == x for all x is natural isn't it ?
The Unicode Consortium doesn't think so, or else they would mandate that
all defined code points have a name.
> In the NameAliases
> one can see that some characters have multiple aliases, so there are
> multiple ways to map a character to a name.
That's a pretty old version -- we're up to version 11 now.
> I propose adding a keyword argument, to
I don't think that's a real URL.
> that would implement one of some useful behavior when the value does
> not exist.
I am cautious about overloading functions with keyword-only arguments to
implement special behaviour. Guido has a strong preference for the "no
constant flags" rule of thumb, (except I think we can extend it beyond
just True/False to any N-state value) and I agree with that.
The rule of thumb says that if you have a function that takes an
optional flag which chooses between two (or more) distinct behaviours,
AND the function is usually called with that flag given as a constant,
then we should usually prefer to split the function into two separately
For example, in the statistics module, I have stdev() and pstdev(),
rather than stdev(population=False) and stdev(population=True).
(Its a rule of thumb, not a hard law of nature. There are exceptions.)
It sounds to me that your proposal would fit those conditions and so we
should prefer a separate function, or a separate API, for doing more
complex name look-ups.
*Especially* if there's a chance that we'll want to extend this some day
to use more flags...
which are all alias types defined by NameAliases.txt.
> One simple behavior would be to chose the name in the "abbreviation"
> list. Currently all characters except three only have one and only one
> abbreviation so that would be a good pick, so I'd imagine name('\x00',
> abbreviation=True) == 'NUL'
To my mind, that calls out for a separate API to return character alias
properties as a separate data type:
=> UnicodeAlias(control='START OF HEADING', abbreviation='SOH')
=> UnicodeAlias(control=('LINE TABULATION', 'VERTICAL TABULATION'),
# alternatively, fields could be a single semi-colon delimited
string rather than a tuple in the event of multiple aliases
=> UnicodeAlias(correction='LATIN CAPITAL LETTER GHA')
=> UnicodeAlias(figment='SINGLE GRAPHIC CHARACTER INTRODUCER',
Fields not shown return the empty string.
This avoids overloading the name() function, future-proofs against new
alias types, and if UnicodeAlias is a mutable object, easily permits the
caller to customise the records to suit their own application's needs:
alias = unicodedata.alias(char)
if char == '\U0001f346':
alias.other = ('eggplant', 'purple vegetable')
alias.slang = ('phallic', ... )
More information about the Python-ideas