fast regex

Sat May 8 10:26:14 EDT 2010

Bryan wrote:
> Tim Chase wrote:
>> James wrote:
>>> [Tim had written:]
>> If the keys in your word_list are more than just words, then the
>> regexp may not find them all, and thus not replace them all.  In
>> that case you may have to resort to my 2nd regexp which builds
>> the 5k branch regexp from your actual dictionary keys:
>>
>>>>  r = re.compile(r'\b(%s)\b' % (
>>>>    '|'.join(re.escape(s) for s in words_list.keys())
>>>>    ),
>>>>    re.IGNORECASE)
>> This method on the above dictionary (modified)
>>
>>    d = {
>>      'hello': 'goodbye',
>>      'world': 'python',
>>      'stuff with spaces?': 'tadah!',
>>      }
>>
>> would create a regexp of
>>
>>    \b(about|world|stuff\ with\ spaces\?)\b
> 
> 
> There's a subtle issue of the order of the keywords. Pythons
> (ir)regular expressions do not strictly guarantee to find the longest
> match. For example,
> 
> re.findall(r'\b(about|about\-faced)\b',
>         'does about-faced match?')
> 
> returns ['about'], which is the first complete match, not the longest.
> Sorting the keywords into longest-first order avoids the problem.
> 
> 
>> This has considerable performance implications as len(word_list)
>> grows, unless you can figure a way to determine that some
>> replacements are more probable than others and push them to the
>> front of this regexp, but that's more complex and requires
>> knowledge of your word-list.
> 
> There is another method which is to factor out common prefixes, so the
> re module's search-and-backtrack engine never has a choice of multiple
> paths. Instead of:
> 
>     \b(abolished|abolition)\b
> 
> we'd use:
> 
>     \b(aboli(shed|tion)\b
> 
> The RE-generator is of course more complex than Tim's one-liner, but I
> happen to have code, which I'll include below. It's probably academic,
> as I'd agree with Tim that his first solution is the better candidate
> at this point.
> 
> With my giant prefix-combined RE's, I can re.compile one with 4000
> words randomly chosen from a Unix words file, but 5000 results in
> "regular expression code size limit exceeded". Tim's version which
> doesn't combine prefixes tops out a little lower. This is on 32-bit
> Windows, standard distribution. One could, of course, build multiple
> smaller RE's, but that starts to suck.
> 
There's an alternative regex module at:

     http://bugs.python.org/issue2636

and also at:

     http://pypi.python.org/pypi/regex/0.1.2010414

It looks for common prefixes and suffixes internally and can handle much
larger regexes.