Using re to find unicode ranges
girzel at gmail.com
Tue Sep 30 05:46:17 CEST 2008
On Sep 29, 11:03 pm, "Mark Tolonen" <M8R-yft... at mailinator.com> wrote:
> "Eric Abrahamsen" <e... at ericabrahamsen.net> wrote in message
> news:mailman.1674.1222694261.3487.python-list at python.org...
> > Is it possible to use the re module to find runs of characters within a
> > certain Unicode range?
> > I'm writing a Markdown extension to go over text and wrap blocks of
> > consecutive Chinese characters in <span class="char"></span> tags for
> > nice styling in an HTML page. The available hooks appear to be a pre-
> > processor (which is a "for line in lines" situation) or an inline pattern
> > (which uses regular expressions). The regular expression solution would
> > be much simpler and faster, but something tells me there's no way to use
> > a regex to find character ranges... Chinese characters appear to fall
> > between 19968 and 40959 using ord(), and I suppose I can go that route if
> > necessary, but I think it would be ugly.
> # coding: utf-8
> import re
> sample = u'My name is 马克. I am 美国人.'
> for n in re.findall(ur'[\u4e00-\u9fff]+',sample):
> print n
Of course! And obvious, once you point it out. Thanks for the help.
> This sounds similar to what zhpy (http://pyparsing.wikispaces.com/
> WhosUsingPyparsing#Zhpy) does to extract Chinese words from code, to
> generate executable English Python. You might give that a look.
Mark - not quite what I'm after here, but pretty interesting
More information about the Python-list