Unicode regex and Hindi language
Peter Otten
__peter__ at web.de
Fri Nov 28 11:29:21 EST 2008
Shiao wrote:
> The regex below identifies words in all languages I tested, but not in
> Hindi:
>
> # -*- coding: utf-8 -*-
>
> import re
> pat = re.compile('^(\w+)$', re.U)
> langs = ('English', '中文', 'हिन्दी')
>
> for l in langs:
> m = pat.search(l.decode('utf-8'))
> print l, m and m.group(1)
>
> Output:
>
> English English
> 中文 中文
> हिन्दी None
>
> From this is assumed that the Hindi text contains punctuation or other
> characters that prevent the word match. Now, even more alienating is
> this:
>
> pat = re.compile('^(\W+)$', re.U) # note: now \W
>
> for l in langs:
> m = pat.search(l.decode('utf-8'))
> print l, m and m.group(1)
>
> Output:
>
> English None
> 中文 None
> हिन्दी None
>
> How can the Hindi be both not a word and "not not a word"??
>
> Any clue would be much appreciated!
It's not a word, but that doesn't mean that it consists entirely of
non-alpha characters either. Here's what Python gets to see:
>>> langs[2]
u'\u0939\u093f\u0928\u094d\u0926\u0940'
>>> from unicodedata import name
>>> for c in langs[2]:
... print repr(c), name(c), ["non-alpha", "ALPHA"][c.isalpha()]
...
u'\u0939' DEVANAGARI LETTER HA ALPHA
u'\u093f' DEVANAGARI VOWEL SIGN I non-alpha
u'\u0928' DEVANAGARI LETTER NA ALPHA
u'\u094d' DEVANAGARI SIGN VIRAMA non-alpha
u'\u0926' DEVANAGARI LETTER DA ALPHA
u'\u0940' DEVANAGARI VOWEL SIGN II non-alpha
Peter
More information about the Python-list
mailing list