[issue16684] Unicode property value abbreviated names and long names
Greg Price
report at bugs.python.org
Fri Sep 20 03:56:31 EDT 2019
Greg Price <gnprice at gmail.com> added the comment:
I've gone and implemented a version of this that's integrated into Tools/unicode/makeunicodedata.py , and into the unicodedata module. Patch attached. Demo:
>>> import unicodedata, pprint
>>> pprint.pprint(unicodedata.property_value_aliases)
{'bidirectional': {'AL': ['Arabic_Letter'],
# ...
'WS': ['White_Space']},
'category': {'C': ['Other'],
# ...
'east_asian_width': {'A': ['Ambiguous'],
# ...
'W': ['Wide']}}
Note that the values are lists. That's because a value can have multiple aliases in addition to its "short name":
>>> unicodedata.property_value_aliases['category'][unicodedata.category('4')]
['Decimal_Number', 'digit']
This implementation also provides the reverse mapping, from an alias to the "short name":
>>> pprint.pprint(unicodedata.property_value_by_alias)
{'bidirectional': {'Arabic_Letter': 'AL',
# ...
This draft doesn't have tests or docs, but it's otherwise complete. I've posted it at this stage for feedback on a few open questions:
* This version is in C; at import time some C code builds up the dicts, from static tables in the header generated by makeunicodedata.py . It's not *that* much code... but it sure would be more convenient to do in Python instead.
Should the unicodedata module perhaps have a Python part? I'd be happy to go about that -- rename the existing C module to _unicodedata and add a small unicodedata.py wrapper -- if there's a feeling that it'd be a good idea. Then this could go there instead of using the C code I've just written.
* Is this API the right one?
* This version has e.g. unicodedata.property_value_by_alias['category']['Decimal_Number'] == 'Nd' .
* Perhaps make category/bidirectional/east_asian_width into attributes rather than keys? So e.g. unicodedata.property_value_by_alias.category['Decimal_Number'] == 'Nd' .
* Or: the standard says "loose matching" should be applied to these names, so e.g. 'decimal number' or 'is-decimal-number' is equivalent to 'Decimal_Number'. To accomplish that, perhaps make it not dicts at all but functions?
So e.g. unicodedata.property_value_by_alias('decimal number') == unicodedata.property_value_by_alias('Decimal_Number') == 'Nd' .
* There's also room for bikeshedding on the names.
* How shall we handle ucd_3_2_0 for this feature?
This implementation doesn't attempt to record the older version of the data. My reasoning is that because the applications of the old data are quite specific and they haven't needed this information yet, it seems unlikely anyone will ever really want to know from this module just which aliases existed already in 3.2.0 and which didn't yet.
OTOH, as a convenience I've caused e.g. unicodedata.ucd_3_2_0.property_value_by_alias to exist, just pointing to the same object as unicodedata.property_value_by_alias . This allows unicodedata.ucd_3_2_0 to remain a near drop-in substitute for the unicodedata module itself, while minimizing the complexity it adds to the implementation.
Might be cleanest to just leave these off of ucd_3_2_0 entirely, though. It's still easy to get at them -- just get them from the module itself -- and it makes it explicit that you're getting current rather than old data.
----------
keywords: +patch
nosy: +Greg Price
versions: +Python 3.9 -Python 3.8
Added file: https://bugs.python.org/file48616/prop-val-aliases.patch
_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue16684>
_______________________________________
More information about the Python-bugs-list
mailing list