[Tutor] wiki madness grows one you like a brain fungus [regex refinements, re.sub() can take in a subst. function]

Sun Aug 10 03:53:26 EDT 2003

> Here's a compact, faster version of iswikiword():
>
>     import re
>     wikiword = re.compile('[A-Z][^A-Z]+[A-Z]').search
>     def iswikiword(word, wikiword = re.compile('[A-Z][^A-Z]+[A-Z]').search):
>         return wikiword(word) is not None

Hi Raymond and Kirk,

The regular expression is too lenient: the second word needs to be at
least two characters long.  For example,

    TeX

isn't considered a WikiWord, but your regular expression

    [A-Z][^A-Z]+[A-Z]

will match it.

There's one other problem: we need to introduce "anchors" that guarantee
that we won't try to hunt for WikiWords within our word.  That is, we
definitely want to avoid matching:

    thisIsAnEmbeddedsentence

But this is easy to fix: for our real regular expression, we can use '^'
and '$' to mark the beginning and ending of our pattern.  Alternatively,
we can use the '\b' word-break metapattern.

I wrote a rambling post a while back:

    http://mail.python.org/pipermail/tutor/2003-August/024579.html

where, at the end, I had calculated an alternative regular expression:

    [A-Z][a-z]+
    [A-Z][a-z]
    ([A-Z][a-z] | [a-z])*

If we look at it in a twisted way, it even makes a kind of sense.  But now
that I think about it more, there's a conceptually much simpler one:

    ([A-Z][a-z]+){2,}

which concisely says "two or more capitalized words".  This just goes to
show that too much education threatens to obscure simple solutions.
*grin*

With this in mind, here is a modified iswikiword() that should be more
strict:

###
def iswikiword(word):
    regex = re.compile(r'''
                       \b                 ## beginning break,
                       ([A-Z][a-z]+){2,}  ## at least two capitalized
                                          ## words,
                       \b                 ## and an ending break.
                       ''', re.VERBOSE)
    return regex.match(word)
###

Let's see how it works:

###
>>> iswikiword('TeX')
>>> iswikiword('WikiWord')
<_sre.SRE_Match object at 0x2028e0>
>>> iswikiword('WikiWordS')
>>> iswikiword('thisIsEmbedded?')
>>>
###

Kirk, your main processing loop:

###
for rawline in page:
    line = string.split(rawline,' ')
    ...
    for word in line:
        if iswikiword(word):
            ...
###

is doing too much work.  It's possible to do all the substitutions at once
by using re.sub(), the substitution function.  Here's a concrete example:

###
>>> wiki_regex = re.compile(r'\b([A-Z][a-z]+){2,}\b')
>>> def processWikiMatch(match):
...     '''Given a match object of a WikiWord, returns a link to that
...        WikiWord.'''
...     return makewikilink(match.group(0))
...
>>> def makewikilink(word):
...     '''proxy function; replace this with real function'''
...     return '<a href="">%s</a>' % word
...
>>> result = wiki_regex.sub(processWikiMatch,
...                         '''
... This is a test of feeding a WikiWorded sentence
... into a RegularExpression, using a
... SubstitutionFunction.  Cool, huh?''')
>>> print result

This is a test of feeding a <a href="">WikiWorded</a> sentence
into a <a href="">RegularExpression</a>, using a
<a href="">SubstitutionFunction</a>.  Cool, huh?
###

re.sub() here can optionally taking a "subsitution" function, rather than
just a simple string!  This is something that isn't as well known as it
deserves to be.

Hope this helps!