[Tutor] Re: lists in re?

Tue Sep 9 15:07:29 EDT 2003

On Tue, 9 Sep 2003, Andreas Zwinkau wrote:

> > parser = [ re.compile(item) for item in ["[abc]","abc","[A-Z]"] ]
> I'm not used to such "complex" constructs, but well, I can work with it.
>
> > Any chance of an example of what you're putting into it and what the
> > code is supposed to make out of it? It would be most useful if you'd
> > put in some *real* examples, because a key-value pair like "abc":"def"
> > is a bit too abstract for me to understand its purpose.
> Do you know what a Wiki is? I have some user text with abstract text
> formations e.g. __underlined__
> These should be processed into HTML <u>underlined</u>
>
> There are some more rules.
> ''italic''                                  -> <i>italic</i>
> **bold**                              -> <b>bold</b>
> [word]                                  -> <a href="word">word</a>
> [image.gif]                           -> <img src="image.gif" />
> [http://google.com|Google] -> <a href="http://google.com>Google</a>
> http://slashdot.org               -> <a
> href="http://slashdot.org">http://slashdot.org</a>
>
>
> If this gets more and more, i thought a dictionary would be the best way
> to define it in a obvious way. So this dict needs to be fed to the re
> module, but instead of processing each item, i wanted to re.compile it
> in one piece.

Hi Andreas,

Rather than hardcode the list of patterns in the program, it might be
useful to define these different constructs in a separate text file
instead.  We can then write a function to take that text file and generate
a class to process it.

For example, let's say that we had a text file like:

######
### wikirules.txt

BOLD: \*\*(\w+)\*\*
ITALIC: ''(\w+)''

## The last two rules are 'catchalls' and should be at the bottom of this
## list
WHITESPACE: (\s+)
ANYTHING: (\S+)
######

We could then write a program that takes these rules, and creates a
function that knows how to analyze text.  Below is a toy example.  I wrote
it quickly, so it's definitely not production-quality code, nor is it
commented well... *grin*

###
def makeAnalyzer(pattern_text):
    """Creates a generate that can analyze text, and iteratively returns
       the tokens it can find."""
    tagged_regexs = getTaggedRegexs(pattern_text)
    def analyzer(text):
        while 1:
            if not text: break
            for t, r in tagged_regexs:
                match = r.match(text)
                if match:
                    yield (t, match)
                    text = text[match.end():]
                    break
    return analyzer

def getTaggedRegexs(pattern_text):
    """Takes the pattern text and pulls out a list of
    tag-regexs pairs."""
    tagged_regexs = []
    for line in pattern_text.split('\n'):
        ## ignore comment lines
        if re.match(r'^\s*#', line):
            continue
        try:
            tag, pattern = re.match(r'''^(\w+):\s*(.*?)\s*$''',
                                    line).groups()
        except:
            ## Ignore lines that don't fit our pattern-regex format
            continue
        tagged_regexs.append((tag, re.compile(pattern)))
    return tagged_regexs
###

Here's the program in action:

###
>>> rules = """
... ### wikirules.txt
...
... BOLD: \*\*(\w+)\*\*
... ITALIC: ''(\w+)''
...
... ## The last two rules are 'catchalls'
... WHITESPACE: (\s+)
... ANYTHING: (\S+)
... """
>>> lexer = makeAnalyzer(rules)
>>> for tag, match in lexer("**hello**, this is a ''test''!"):
...     print tag, match.group(0)
...
BOLD **hello**
ANYTHING ,
WHITESPACE
ANYTHING this
WHITESPACE
ANYTHING is
WHITESPACE
ANYTHING a
WHITESPACE
ITALIC ''test''
ANYTHING !
###

Since the rules are defined in a separate file, it becomes easy to switch
in and out different sets of rules, just by copying over a new
wiki-definition file.  The code might end up being a bit long, though...
*grin*

Hope this helps!