fast regex

Tim Chase python.list at tim.thechases.com
Fri May 7 08:32:33 EDT 2010


[your reply appears to have come only to me instead of the 
mailing list; CC'ing c.l.p in reply]

On 05/06/2010 10:12 PM, James Cai wrote:
> When you say "This does a replacement for every word in the input corpus
> (possibly with itself), but only takes one pass through the source text. "
> It sounds great, but I am kinda lost with your code, sorry I am a regex
> newbie.
>
> calling statement
>
> results = r.sub(replacer, content)
>
> the replacer is a function that needs parameter. How does the replacer know
> what parameter?

The documentation on the .sub() method says that the replacement 
can either be some text (as you used) or something callable (in 
my case a function, but could be an object with a __call__ method 
too), and that this callable is passed the match-object that's found.

> why is r = re.compile(r'\b[a-zA-Z]+\b') when the words i want to find should
> be in the word_list?

You don't detail what sorts of words are in your word_list.keys() 
but you'd want your pattern to match those.  My first regexp 
(that you quote) matches single words, but rather loosely (thus 
my caveat about "replace[s] every word in the input").  The 
replacement function checks to see if the replacement is in your 
word-list, and does the replacement, otherwise, it just returns 
the input.  To watch it in action, you can try this:

   d = { # keys are all lowercase
     'hello': 'goodbye',
     'world': 'Python',
     }

   def replacer(match):
     text = match.group(0)
     replacement = d.get(text.lower(), text)

     # see what we're doing for explanation purposes
     print "Replacing %r with %r" % (text, replacement)

     return replacement

   r = re.compile(r'\b[a-zA-Z]+\b')
   print r.sub(replacer, "Hello there world, this is a test")

> Is the match here the match of regex or just a variable name?

If the keys in your word_list are more than just words, then the 
regexp may not find them all, and thus not replace them all.  In 
that case you may have to resort to my 2nd regexp which builds 
the 5k branch regexp from your actual dictionary keys:

>>  r = re.compile(r'\b(%s)\b' % (
>>    '|'.join(re.escape(s) for s in words_list.keys())
>>    ),
>>    re.IGNORECASE)

This method on the above dictionary (modified)

   d = {
     'hello': 'goodbye',
     'world': 'python',
     'stuff with spaces?': 'tadah!',
     }

would create a regexp of

   \b(hello|world|stuff\ with\ spaces\?)\b

This has considerable performance implications as len(word_list) 
grows, unless you can figure a way to determine that some 
replacements are more probable than others and push them to the 
front of this regexp, but that's more complex and requires 
knowledge of your word-list.

However, if all your keys are simply alpha (or alphanumeric, or 
findable by a simple regexp; likely one that doesn't include 
whitespace), you'll likely get much better performance with a 
generic regexp that over-captures, tries to find a replacement in 
your dict, returning that as the replacement; or if it's not in 
the dict, returning the original text unchanged.  My simple test 
would be:

   test_regex = r'\w+'
   r = re.compile(r'^\b%s\b$' % test_regex)
   # added the "^....$" to anchor for testing purposes

   for key in word_list: # keys by default
     if not r.match(key):
       print "Failed to match %r" % key
       break

If this passes, then the regexp should likely be sufficient to 
capture everything needed to use my replacer() function above.

-tkc





More information about the Python-list mailing list