[Tutor] Re: lists in re?
Danny Yoo
dyoo at hkn.eecs.berkeley.edu
Tue Sep 9 15:07:29 EDT 2003
On Tue, 9 Sep 2003, Andreas Zwinkau wrote:
> > parser = [ re.compile(item) for item in ["[abc]","abc","[A-Z]"] ]
> I'm not used to such "complex" constructs, but well, I can work with it.
>
> > Any chance of an example of what you're putting into it and what the
> > code is supposed to make out of it? It would be most useful if you'd
> > put in some *real* examples, because a key-value pair like "abc":"def"
> > is a bit too abstract for me to understand its purpose.
> Do you know what a Wiki is? I have some user text with abstract text
> formations e.g. __underlined__
> These should be processed into HTML <u>underlined</u>
>
> There are some more rules.
> ''italic'' -> <i>italic</i>
> **bold** -> <b>bold</b>
> [word] -> <a href="word">word</a>
> [image.gif] -> <img src="image.gif" />
> [http://google.com|Google] -> <a href="http://google.com>Google</a>
> http://slashdot.org -> <a
> href="http://slashdot.org">http://slashdot.org</a>
>
>
> If this gets more and more, i thought a dictionary would be the best way
> to define it in a obvious way. So this dict needs to be fed to the re
> module, but instead of processing each item, i wanted to re.compile it
> in one piece.
Hi Andreas,
Rather than hardcode the list of patterns in the program, it might be
useful to define these different constructs in a separate text file
instead. We can then write a function to take that text file and generate
a class to process it.
For example, let's say that we had a text file like:
######
### wikirules.txt
BOLD: \*\*(\w+)\*\*
ITALIC: ''(\w+)''
## The last two rules are 'catchalls' and should be at the bottom of this
## list
WHITESPACE: (\s+)
ANYTHING: (\S+)
######
We could then write a program that takes these rules, and creates a
function that knows how to analyze text. Below is a toy example. I wrote
it quickly, so it's definitely not production-quality code, nor is it
commented well... *grin*
###
def makeAnalyzer(pattern_text):
"""Creates a generate that can analyze text, and iteratively returns
the tokens it can find."""
tagged_regexs = getTaggedRegexs(pattern_text)
def analyzer(text):
while 1:
if not text: break
for t, r in tagged_regexs:
match = r.match(text)
if match:
yield (t, match)
text = text[match.end():]
break
return analyzer
def getTaggedRegexs(pattern_text):
"""Takes the pattern text and pulls out a list of
tag-regexs pairs."""
tagged_regexs = []
for line in pattern_text.split('\n'):
## ignore comment lines
if re.match(r'^\s*#', line):
continue
try:
tag, pattern = re.match(r'''^(\w+):\s*(.*?)\s*$''',
line).groups()
except:
## Ignore lines that don't fit our pattern-regex format
continue
tagged_regexs.append((tag, re.compile(pattern)))
return tagged_regexs
###
Here's the program in action:
###
>>> rules = """
... ### wikirules.txt
...
... BOLD: \*\*(\w+)\*\*
... ITALIC: ''(\w+)''
...
... ## The last two rules are 'catchalls'
... WHITESPACE: (\s+)
... ANYTHING: (\S+)
... """
>>> lexer = makeAnalyzer(rules)
>>> for tag, match in lexer("**hello**, this is a ''test''!"):
... print tag, match.group(0)
...
BOLD **hello**
ANYTHING ,
WHITESPACE
ANYTHING this
WHITESPACE
ANYTHING is
WHITESPACE
ANYTHING a
WHITESPACE
ITALIC ''test''
ANYTHING !
###
Since the rules are defined in a separate file, it becomes easy to switch
in and out different sets of rules, just by copying over a new
wiki-definition file. The code might end up being a bit long, though...
*grin*
Hope this helps!
More information about the Tutor
mailing list