[Python-ideas] Matching multiple regex patterns simultaneously

M.-A. Lemburg mal at egenix.com
Tue Mar 2 23:09:34 CET 2010


Andrey Fedorov wrote:
> So a couple of libraries (Django being the most popular that comes to mind)
> try to match a string against several regex expressions. I'm wondering if
> there exists a library to "merge" multiple compiled regex expressions into a
> single lookup. This could be exposed in a interface like:
> 
> http://gist.github.com/319905
> 
> 
> So for an example:
> 
> rd = ReDict()
> 
> rd['^foo$'] = 1
> rd['^bar*$'] = 2
> rd['^bar$'] = 3
> 
> assert rd['foo'] == [1]
> assert rd['barrrr'] == [2]
> assert rd['bar'] == [2,3]
> 
> The naive implementation I link is obviously inefficient. What would be the
> easiest way to go about compiling a set of regex-es together, so that they
> can be matched against a string at the same time? Are there any standard
> libraries that do this I'm not aware of?

This is a rather difficult problem to solve in general.

If you only need to search for a finite set of words, there are few
good algorithms for this:

http://en.wikipedia.org/wiki/Aho-Corasick_algorithm
 * used in Unix fgrep
 * Python implementation:
   http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm#Rabin.E2.80.93Karp_and_multiple_pattern_search
 * uses hashing

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Mar 02 2010)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/



More information about the Python-ideas mailing list