Unicode script

Thu Dec 15 14:45:15 EST 2016

On 12/15/2016 11:53 AM, Steve D'Aprano wrote:
> Suppose I have a Unicode character, and I want to determine the script or
> scripts it belongs to.
>
> For example:
>
> U+0033 DIGIT THREE "3" belongs to the script "COMMON";
> U+0061 LATIN SMALL LETTER A "a" belongs to the script "LATIN";
> U+03BE GREEK SMALL LETTER XI "ξ" belongs to the script "GREEK".
>
> Is this information available from Python?

Yes, though not as nicely as you probably want.  (Have you searched for 
existing 3rd party modules?)  As near as I can tell, there is no direct 
'script' property in the unicodedatabase.

Option 1: unicodedata module, from char name

 >>> import unicodedata as ucd
 >>> ucd.name('\u03be')
'GREEK SMALL LETTER XI'
 >>> ucd.name('\u0061')
'LATIN SMALL LETTER A'

In most cases, the non-common char names start with a script name.
In some cases, the script name is 2 or w words.

 >>> ucd.name('\U00010A60')
'OLD SOUTH ARABIAN LETTER HE'

In a few cases, the script name is embedded in the name.
 >>> ucd.name('\U0001F200')
'SQUARE HIRAGANA HOKA'

Occasionally the script name is omitted.
 >>> ucd.name('\u3300')
'SQUARE APAATO'  # Katakana

To bad the Unicode Consortium did not use a consistent name scheme:
script [, subscript]: character

LATIN: SMALL LETTER A
ARABIAN, OLD SOUTH: LETTER HE
KATAKANA: SQUARE APAATO

> More about Unicode scripts:
>
> http://www.unicode.org/reports/tr24/
> http://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt

Option 2: Fetch the above Scripts.txt.

Suboption 1: Turn Scripts.txt into a list of lines.  The lines could be 
condensed to codepoint or codepoint range, script.  Write a function 
that takes a character or codepoint and linearly scans the list for a 
matching line.  This makes each lookup O(number-of-lines).

Suboption 2. Turn Scripts.txt into a list of scripts, with codepoint 
being the index.  This takes more preparation, but makes each lookup 
O(1).  Once the preparation is done, the list could be turned into a 
tuple and saved as a .py file, with the tuple being a compiled constant 
in a .pyc file.

To avoid bloat, make sure that multiple entries for a script use the 
same string object instead of multiple equal strings.  (CPython string 
interning might do this automatically, but cross-implementation code 
should not depend on this.)  The difference is

scripts = [..., 'Han', 'Han', 'Han', ...] # multiple strings
versus
HAN = 'Han'
scripts = [..., HAN, HAN, HAN, ...]  # multiple references to one string

On a 64 bit OS, the latter would use 8 x defined codepoints (about 
200,000) bytes.  Assuming such does not already exits, it might be worth 
making such a module available on PyPI.

> http://www.unicode.org/Public/UCD/latest/ucd/ScriptExtensions.txt

Essentially, ditto, except that I would use a dict rather than a 
sequence as there are only about 400 codepoints involved.

-- 
Terry Jan Reedy