Unicode script

MRAB python at mrabarnett.plus.com
Sat Dec 17 21:34:49 EST 2016


On 2016-12-16 02:44, MRAB wrote:
> On 2016-12-15 21:57, Terry Reedy wrote:
>> On 12/15/2016 1:06 PM, MRAB wrote:
>>> On 2016-12-15 16:53, Steve D'Aprano wrote:
>>>> Suppose I have a Unicode character, and I want to determine the script or
>>>> scripts it belongs to.
>>>>
>>>> For example:
>>>>
>>>> U+0033 DIGIT THREE "3" belongs to the script "COMMON";
>>>> U+0061 LATIN SMALL LETTER A "a" belongs to the script "LATIN";
>>>> U+03BE GREEK SMALL LETTER XI "ξ" belongs to the script "GREEK".
>>>>
>>>>
>>>> Is this information available from Python?
>>>>
>>>>
>>>> More about Unicode scripts:
>>>>
>>>> http://www.unicode.org/reports/tr24/
>>>> http://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt
>>>> http://www.unicode.org/Public/UCD/latest/ucd/ScriptExtensions.txt
>>>>
>>>>
>>> Interestingly, there's issue 6331 "Add unicode script info to the
>>> unicode database". Looks like it didn't make it into Python 3.6.
>>
>> https://bugs.python.org/issue6331
>> Opened in 2009 with patch and 2 revisions for 2.x.  At least the Python
>> code needs to be updated.
>>
>> Approved in principle by Martin, then unicodedata curator, but no longer
>> active.  Neither, very much, are the other 2 listed in the Expert's index.
>>
>>  From what I could see, both the Python API (there is no doc patch yet)
>> and internal implementation need more work.  If I were to get involved,
>> I would look at the APIs of PyICU (see Eryk Sun's post) and the
>> unicodescript module on PyPI (mention by Pander Musubi, on the issue).
>>
> For what it's worth, the post has prompted me to get back to a module I
> started which will report such Unicode properties, essentially the ones
> that the regex module supports. It just needs a few more tweaks and
> packaging up...
>
Finally completed and uploaded!

It's called 'uniprop' and it's at:

https://pypi.python.org/pypi/uniprop/1.0

For Python 3.4-3.6.



More information about the Python-list mailing list