Using re to find unicode ranges
ptmcg at austin.rr.com
Mon Sep 29 16:45:47 CEST 2008
On Sep 29, 8:17 am, Eric Abrahamsen <e... at ericabrahamsen.net> wrote:
> Is it possible to use the re module to find runs of characters within
> a certain Unicode range?
> I'm writing a Markdown extension to go over text and wrap blocks of
> consecutive Chinese characters in <span class="char"></span> tags for
> nice styling in an HTML page. The available hooks appear to be a pre-
> processor (which is a "for line in lines" situation) or an inline
> pattern (which uses regular expressions). The regular expression
> solution would be much simpler and faster, but something tells me
> there's no way to use a regex to find character ranges... Chinese
> characters appear to fall between 19968 and 40959 using ord(), and I
> suppose I can go that route if necessary, but I think it would be ugly.
> Any hints or suggestions would be appreciated!
This sounds similar to what zhpy (http://pyparsing.wikispaces.com/
WhosUsingPyparsing#Zhpy) does to extract Chinese words from code, to
generate executable English Python. You might give that a look.
More information about the Python-list