Matching multiple regex patterns simultaneously

So a couple of libraries (Django being the most popular that comes to mind) try to match a string against several regex expressions. I'm wondering if there exists a library to "merge" multiple compiled regex expressions into a single lookup. This could be exposed in a interface like: http://gist.github.com/319905 So for an example: rd = ReDict() rd['^foo$'] = 1 rd['^bar*$'] = 2 rd['^bar$'] = 3 assert rd['foo'] == [1] assert rd['barrrr'] == [2] assert rd['bar'] == [2,3] The naive implementation I link is obviously inefficient. What would be the easiest way to go about compiling a set of regex-es together, so that they can be matched against a string at the same time? Are there any standard libraries that do this I'm not aware of? Cheers, Andrey

Andrey Fedorov wrote:
This is a rather difficult problem to solve in general. If you only need to search for a finite set of words, there are few good algorithms for this: http://en.wikipedia.org/wiki/Aho-Corasick_algorithm * used in Unix fgrep * Python implementation: http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/ http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm#Rabin.E2.80.... * uses hashing -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Mar 02 2010)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/


Stefan Behnel wrote:
Nice :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Mar 04 2010)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On 03/02/2010 09:39 PM, Andrey Fedorov wrote:
You can do something like this: r=re.compile('(?P<a>^foo$)|(?P<b>(?P<c>^bar)r*$)')
Ok, it's not 100% the same (it does not match 'ba'), but I think this should cover most cases where you want something like this. Hmm, well. You should resolve it to a form where there are no overlappings in the subexpressions: (?P<a>^foo$)|(?P<b>^ba$)|(?P<c>^bar$)|(?P<d>^bar+$) -panzi

Andrey Fedorov wrote:
The only thing I'm aware of at the moment is my own Plex library, but it's currently implemented in pure Python, so it's not as efficient as the re module with an extension like this would be. I have a half-finished project on the back burner to reimplement the core of Plex using Pyrex, but it's been languishing for so long that the burner has probably gone out by now. Maybe I should relight it... -- Greg

Greg Ewing, 05.03.2010 00:07:
Cython bootstraps the tiny truely performance-critical part of it (Scanners.py) into C code during installation and uses an external Scanners.pxd file to override the Python types in the module with C types. The other modules don't seem to matter in benchmarks, but compiling the inner loop of the scanner clearly gives a performance boost. Stefan

Andrey Fedorov wrote:
This is a rather difficult problem to solve in general. If you only need to search for a finite set of words, there are few good algorithms for this: http://en.wikipedia.org/wiki/Aho-Corasick_algorithm * used in Unix fgrep * Python implementation: http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/ http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm#Rabin.E2.80.... * uses hashing -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Mar 02 2010)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/


Stefan Behnel wrote:
Nice :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Mar 04 2010)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On 03/02/2010 09:39 PM, Andrey Fedorov wrote:
You can do something like this: r=re.compile('(?P<a>^foo$)|(?P<b>(?P<c>^bar)r*$)')
Ok, it's not 100% the same (it does not match 'ba'), but I think this should cover most cases where you want something like this. Hmm, well. You should resolve it to a form where there are no overlappings in the subexpressions: (?P<a>^foo$)|(?P<b>^ba$)|(?P<c>^bar$)|(?P<d>^bar+$) -panzi

Andrey Fedorov wrote:
The only thing I'm aware of at the moment is my own Plex library, but it's currently implemented in pure Python, so it's not as efficient as the re module with an extension like this would be. I have a half-finished project on the back burner to reimplement the core of Plex using Pyrex, but it's been languishing for so long that the burner has probably gone out by now. Maybe I should relight it... -- Greg

Greg Ewing, 05.03.2010 00:07:
Cython bootstraps the tiny truely performance-critical part of it (Scanners.py) into C code during installation and uses an external Scanners.pxd file to override the Python types in the module with C types. The other modules don't seem to matter in benchmarks, but compiling the inner loop of the scanner clearly gives a performance boost. Stefan
participants (5)
-
Andrey Fedorov
-
Greg Ewing
-
M.-A. Lemburg
-
Mathias Panzenböck
-
Stefan Behnel