[Tutor] regex: matching unicode

Mon Dec 24 08:51:58 CET 2012

>>Is the code below the only/shortest way to match unicode characters? I would like to match whatever is defined as a character in the unicode reference database. So letters in the broadest sense of the word, but not digits, underscore or whitespace. Until just now, I was convinced that the re.UNICODE flag generalized the [a-z] class to all unicode letters, and that the absence of re.U was an implicit 're.ASCII'. Apparently that mental model was *wrong*.

>>But [^\W\s\d_]+ is kind of hard to read/write.
>>
>>import re
>>s = unichr(956)  # mu sign
>>m = re.match(ur"[^\W\s\d_]+", s, re.I | re.U)
>>
>>
>A thought would be to rely on the general category of the character, as listed in the Unicode database. Unicodedata.category will give you what you need. Here is a list of categories in the Unicode standard:
>
>
>http://www.fileformat.info/info/unicode/category/index.htm
>
>
>
>So, if you wanted only letters, you could say:
>
>
>def is_unicode_character(c):
>    assert len(c) == 1
>    return 'L' in unicodedata.category(c)

Hi everybody,

Thanks for your replies, they have been most insightful. For now the 'unicodedata' approach works best for me. I need to validate a word and this is now a two-step approach. First, check if the first character is a (unicode) letter, second, do other checks with a regular regex (ie., no spaces, ampersands and whatnot). Using one regex would be more elegant though, but I got kinda intimidated by the hail of additional flags in the regex module.
Having unicode versions of the classes \d, \w, etc (let's call them \ud, \uw) would be cool.Here another useful way to use your (Hugo's) function. The Japanese hangul sign and the degree sign almost look the same!

import unicodedata

hangul = unichr(4363)
degree = unichr(176)

def isUnicodeChar(c):
  assert len(c) == 1
  c = c.decode("utf-8") if isinstance(c, str) else c
  return 'L' in unicodedata.category(c)

>>> isUnicodeChar(hangul)
True
>>> isUnicodeChar(degree)
False