tjreedy at udel.edu
Thu Dec 15 14:45:15 EST 2016
On 12/15/2016 11:53 AM, Steve D'Aprano wrote:
> Suppose I have a Unicode character, and I want to determine the script or
> scripts it belongs to.
> For example:
> U+0033 DIGIT THREE "3" belongs to the script "COMMON";
> U+0061 LATIN SMALL LETTER A "a" belongs to the script "LATIN";
> U+03BE GREEK SMALL LETTER XI "ξ" belongs to the script "GREEK".
> Is this information available from Python?
Yes, though not as nicely as you probably want. (Have you searched for
existing 3rd party modules?) As near as I can tell, there is no direct
'script' property in the unicodedatabase.
Option 1: unicodedata module, from char name
>>> import unicodedata as ucd
'GREEK SMALL LETTER XI'
'LATIN SMALL LETTER A'
In most cases, the non-common char names start with a script name.
In some cases, the script name is 2 or w words.
'OLD SOUTH ARABIAN LETTER HE'
In a few cases, the script name is embedded in the name.
'SQUARE HIRAGANA HOKA'
Occasionally the script name is omitted.
'SQUARE APAATO' # Katakana
To bad the Unicode Consortium did not use a consistent name scheme:
script [, subscript]: character
LATIN: SMALL LETTER A
ARABIAN, OLD SOUTH: LETTER HE
KATAKANA: SQUARE APAATO
> More about Unicode scripts:
Option 2: Fetch the above Scripts.txt.
Suboption 1: Turn Scripts.txt into a list of lines. The lines could be
condensed to codepoint or codepoint range, script. Write a function
that takes a character or codepoint and linearly scans the list for a
matching line. This makes each lookup O(number-of-lines).
Suboption 2. Turn Scripts.txt into a list of scripts, with codepoint
being the index. This takes more preparation, but makes each lookup
O(1). Once the preparation is done, the list could be turned into a
tuple and saved as a .py file, with the tuple being a compiled constant
in a .pyc file.
To avoid bloat, make sure that multiple entries for a script use the
same string object instead of multiple equal strings. (CPython string
interning might do this automatically, but cross-implementation code
should not depend on this.) The difference is
scripts = [..., 'Han', 'Han', 'Han', ...] # multiple strings
HAN = 'Han'
scripts = [..., HAN, HAN, HAN, ...] # multiple references to one string
On a 64 bit OS, the latter would use 8 x defined codepoints (about
200,000) bytes. Assuming such does not already exits, it might be worth
making such a module available on PyPI.
Essentially, ditto, except that I would use a dict rather than a
sequence as there are only about 400 codepoints involved.
Terry Jan Reedy
More information about the Python-list