[Tutor] wiki madness grows one you like a brain fungus [regex
refinements, re.sub() can take in a subst. function]
Danny Yoo
dyoo at hkn.eecs.berkeley.edu
Sun Aug 10 03:53:26 EDT 2003
> Here's a compact, faster version of iswikiword():
>
> import re
> wikiword = re.compile('[A-Z][^A-Z]+[A-Z]').search
> def iswikiword(word, wikiword = re.compile('[A-Z][^A-Z]+[A-Z]').search):
> return wikiword(word) is not None
Hi Raymond and Kirk,
The regular expression is too lenient: the second word needs to be at
least two characters long. For example,
TeX
isn't considered a WikiWord, but your regular expression
[A-Z][^A-Z]+[A-Z]
will match it.
There's one other problem: we need to introduce "anchors" that guarantee
that we won't try to hunt for WikiWords within our word. That is, we
definitely want to avoid matching:
thisIsAnEmbeddedsentence
But this is easy to fix: for our real regular expression, we can use '^'
and '$' to mark the beginning and ending of our pattern. Alternatively,
we can use the '\b' word-break metapattern.
I wrote a rambling post a while back:
http://mail.python.org/pipermail/tutor/2003-August/024579.html
where, at the end, I had calculated an alternative regular expression:
[A-Z][a-z]+
[A-Z][a-z]
([A-Z][a-z] | [a-z])*
If we look at it in a twisted way, it even makes a kind of sense. But now
that I think about it more, there's a conceptually much simpler one:
([A-Z][a-z]+){2,}
which concisely says "two or more capitalized words". This just goes to
show that too much education threatens to obscure simple solutions.
*grin*
With this in mind, here is a modified iswikiword() that should be more
strict:
###
def iswikiword(word):
regex = re.compile(r'''
\b ## beginning break,
([A-Z][a-z]+){2,} ## at least two capitalized
## words,
\b ## and an ending break.
''', re.VERBOSE)
return regex.match(word)
###
Let's see how it works:
###
>>> iswikiword('TeX')
>>> iswikiword('WikiWord')
<_sre.SRE_Match object at 0x2028e0>
>>> iswikiword('WikiWordS')
>>> iswikiword('thisIsEmbedded?')
>>>
###
Kirk, your main processing loop:
###
for rawline in page:
line = string.split(rawline,' ')
...
for word in line:
if iswikiword(word):
...
###
is doing too much work. It's possible to do all the substitutions at once
by using re.sub(), the substitution function. Here's a concrete example:
###
>>> wiki_regex = re.compile(r'\b([A-Z][a-z]+){2,}\b')
>>> def processWikiMatch(match):
... '''Given a match object of a WikiWord, returns a link to that
... WikiWord.'''
... return makewikilink(match.group(0))
...
>>> def makewikilink(word):
... '''proxy function; replace this with real function'''
... return '<a href="">%s</a>' % word
...
>>> result = wiki_regex.sub(processWikiMatch,
... '''
... This is a test of feeding a WikiWorded sentence
... into a RegularExpression, using a
... SubstitutionFunction. Cool, huh?''')
>>> print result
This is a test of feeding a <a href="">WikiWorded</a> sentence
into a <a href="">RegularExpression</a>, using a
<a href="">SubstitutionFunction</a>. Cool, huh?
###
re.sub() here can optionally taking a "subsitution" function, rather than
just a simple string! This is something that isn't as well known as it
deserves to be.
Hope this helps!
More information about the Tutor
mailing list