Counting unicode graphemes in python

Srinath Avadhanula srinathava_news at yahoo.com
Fri Oct 24 18:03:30 EDT 2003


On Fri, 24 Oct 2003, vincent wehren wrote:
> |
> | the first two "code points" represent a single character on the screen.
>
> My GUESS is that you can do that unless you *know* exactly which codepoints
> form ligatures. In DEVANAGARI this are e.g. the so-called dependent vowels
> in range 093e - 094c, wherin 093f stands "left of the consonant" when
> rendered. (My knowledge of Indic languages is limited, at best, so there may
> be mor to it..)
>
After a sleepless night, I finally found out that calculating grapheme
boundaries for devanagari is not so hard after all. It seems to work
reasonably well if I use just three simple rules:

To detect whether in the code point sequence 'ab', the junction between
'a' and 'b' is a glyph boundary.

1. If 'b' is some kind of a mark (i.e unicodedata.category(b) starts
   with 'M'), then the 'ab' junction is not a glyph boundary.

2. If 'b' is not a Mark, but is a devanagari letter (i.e category 'Lo')
   AND 'a' is a VIRAMA character i.e, 'VIRAMA' in unicodedata.name(a),
   then the 'ab' junction is not a glyph boundary.

3. In every other situation, the 'ab' junction is a glyph boundary.

Dont really know if this is completely correct, but it performs pretty
well on quite a big sanskrit text I have... Handles things like

NA + HALANT + DHA + HALANT + YA + AA

and reports it (correctly) as a single glyph.

> | In my application, the GUI seems to handle that part (i.e combining
> | characters). However, I need to handle cursor movement myself. The GUI
> | can only be told to move forward by a specified number of bytes.
>
> What GUI are you working with?
>
I am using wxPython on windows XP. There are two text display widgets,
wxTextCtrl and wxStyledTextCtrl. The former is pretty basic but the
caret positioning is pretty robust. The latter is very fancy, hanles
syntax highlighting etc, but has some serious problems with combining
characters.

> Some systems such as the X Server on IndiX seem to dig into the  GPOS and
> GSUB tables in the OpenType font. See:
>
> http://rohini.ncst.ernet.in/indix/doc/HOWTO/Devanagari-HOWTO-5.html
>
Thanks for the link!

> Would "Look at
> http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values " do?

It does indeed. Notice my new-found fluency with unicodedata.category?
:)

Thanks,
Srinath





More information about the Python-list mailing list