Unicode regex and Hindi language

Fri Nov 28 14:29:34 EST 2008

Jerry Hill wrote:
> On Fri, Nov 28, 2008 at 10:47 AM, Shiao <multiseed at gmail.com> wrote:
>> The regex below identifies words in all languages I tested, but not in
>> Hindi:
>>
>> # -*- coding: utf-8 -*-
>>
>> import re
>> pat = re.compile('^(\w+)$', re.U)
>> langs = ('English', '中文', 'हिन्दी')
> 
> I think the problem is that the Hindi Text contains both alphanumeric
> and non-alphanumeric characters.  I'm not very familiar with Hindi,
> much less how it's held in unicode, but take a look at the output of
> this code:
> 
> # -*- coding: utf-8 -*-
> import unicodedata as ucd
> 
> langs = (u'English', u'中文', u'हिन्दी')
> for lang in langs:
>     print lang
>     for char in lang:
>         print "\t %s %s (%s)" % (char, ucd.name(char), ucd.category(char))
> 
> Output:
> 
> English
> 	 E LATIN CAPITAL LETTER E (Lu)
> 	 n LATIN SMALL LETTER N (Ll)
> 	 g LATIN SMALL LETTER G (Ll)
> 	 l LATIN SMALL LETTER L (Ll)
> 	 i LATIN SMALL LETTER I (Ll)
> 	 s LATIN SMALL LETTER S (Ll)
> 	 h LATIN SMALL LETTER H (Ll)
> 中文
> 	 中 CJK UNIFIED IDEOGRAPH-4E2D (Lo)
> 	 文 CJK UNIFIED IDEOGRAPH-6587 (Lo)
> हिन्दी
> 	 ह DEVANAGARI LETTER HA (Lo)
> 	 ि DEVANAGARI VOWEL SIGN I (Mc)
> 	 न DEVANAGARI LETTER NA (Lo)
> 	 ् DEVANAGARI SIGN VIRAMA (Mn)
> 	 द DEVANAGARI LETTER DA (Lo)
> 	 ी DEVANAGARI VOWEL SIGN II (Mc)
> 
> From that, we see that there are some characters in the Hindi string
> that aren't letters (they're not in unicode category L), but are
> instead marks (unicode category M).

Python3.0 allows unicode identifiers.  Mn and Mc characters are included 
  in the set of allowed alphanumeric characters.  'Hindi' is a word in 
both its native characters and in latin tranliteration.

http://docs.python.org/dev/3.0/reference/lexical_analysis.html#identifiers-and-keywords 

re is too restrictive in its definition of 'word'. I suggest that OP 
(original poster) Shiao file a bug report at http://bugs.python.org

tjr