fast regex

Thu May 6 22:47:18 EDT 2010

On 05/06/2010 09:11 PM, james_027 wrote:
> for key, value in words_list.items():
>      compile = re.compile(r"""\b%s\b""" % key, re.IGNORECASE)
>      search = compile.sub(value, content)
>
> where the content is a large text about 500,000 characters and the
> word list is about 5,000

You don't specify what you want to do with "search" vs. 
"content"...are you then reassigning

   content = search

so that subsequent replacements happen?  (your current version 
creates "search", only to discard it)

My first thought would be to make use of re.sub()'s ability to 
take a function and do something like

   # a regexp that finds all possible
   # matches/words of interest
   r = re.compile(r'\b[a-zA-Z]+\b')
   def replacer(match):
     text = match.group(0)
     # assuming your dict.keys() are all lowercase:
     return word_list.get(text.lower(), text)
   results = r.sub(replacer, content)

This does a replacement for every word in the input corpus 
(possibly with itself), but only takes one pass through the 
source text.  If you wanted to get really fancy (and didn't butt 
up against the max size for a regexp), I suppose you could do 
something like

   r = re.compile(r'\b(%s)\b' % (
     '|'.join(re.escape(s) for s in words_list.keys())),
     re.IGNORECASE)
   def replacer(match):
     return word_list[match.group(0).lower()] # assume lower keys
   results = r.sub(replacer, content)

which would only do replacements on your keys rather than every 
"word" in your input, but I'd start with the first version before 
abusing programmatic regexp generation.

-tkc