Unicode regex and Hindi language
Terry Reedy
tjreedy at udel.edu
Fri Nov 28 14:29:34 EST 2008
Jerry Hill wrote:
> On Fri, Nov 28, 2008 at 10:47 AM, Shiao <multiseed at gmail.com> wrote:
>> The regex below identifies words in all languages I tested, but not in
>> Hindi:
>>
>> # -*- coding: utf-8 -*-
>>
>> import re
>> pat = re.compile('^(\w+)$', re.U)
>> langs = ('English', '中文', 'हिन्दी')
>
> I think the problem is that the Hindi Text contains both alphanumeric
> and non-alphanumeric characters. I'm not very familiar with Hindi,
> much less how it's held in unicode, but take a look at the output of
> this code:
>
> # -*- coding: utf-8 -*-
> import unicodedata as ucd
>
> langs = (u'English', u'中文', u'हिन्दी')
> for lang in langs:
> print lang
> for char in lang:
> print "\t %s %s (%s)" % (char, ucd.name(char), ucd.category(char))
>
> Output:
>
> English
> E LATIN CAPITAL LETTER E (Lu)
> n LATIN SMALL LETTER N (Ll)
> g LATIN SMALL LETTER G (Ll)
> l LATIN SMALL LETTER L (Ll)
> i LATIN SMALL LETTER I (Ll)
> s LATIN SMALL LETTER S (Ll)
> h LATIN SMALL LETTER H (Ll)
> 中文
> 中 CJK UNIFIED IDEOGRAPH-4E2D (Lo)
> 文 CJK UNIFIED IDEOGRAPH-6587 (Lo)
> हिन्दी
> ह DEVANAGARI LETTER HA (Lo)
> ि DEVANAGARI VOWEL SIGN I (Mc)
> न DEVANAGARI LETTER NA (Lo)
> ् DEVANAGARI SIGN VIRAMA (Mn)
> द DEVANAGARI LETTER DA (Lo)
> ी DEVANAGARI VOWEL SIGN II (Mc)
>
> From that, we see that there are some characters in the Hindi string
> that aren't letters (they're not in unicode category L), but are
> instead marks (unicode category M).
Python3.0 allows unicode identifiers. Mn and Mc characters are included
in the set of allowed alphanumeric characters. 'Hindi' is a word in
both its native characters and in latin tranliteration.
http://docs.python.org/dev/3.0/reference/lexical_analysis.html#identifiers-and-keywords
re is too restrictive in its definition of 'word'. I suggest that OP
(original poster) Shiao file a bug report at http://bugs.python.org
tjr
More information about the Python-list
mailing list