Seeking regex optimizer

andrewdalke at gmail.com andrewdalke at gmail.com
Mon Jun 19 22:51:34 CEST 2006


Kay Schluehr wrote:
> I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a
> regular expression sx from it, such that sx.match(s) yields a SRE_Match
> object when s starts with an s_i for one i in [0,...,n].

Why do you want to use a regex for this?  When you have constant
strings there are other solutions.  For example, this uses Nicolas
Lehuen's
pytst module from http://nicolas.lehuen.com/index.php/Pytst

>>> import tst
>>> tree = tst.TST()
>>> tree["aa"] = (1, "String with 'aa'")
>>> tree["aahhh"] = (2, "At the doctor's")
>>> tree["a"] = (3, "indefinite article")
>>> tree.scan("This is a bunch of text.  Have you seen an aardvark?",
...   tst.TupleListAction())
[('This is ', -8, None), ('a', 1, (3, 'indefinite article')), (' bunch
of text.  H', -18, None), ('a', 1, (3, 'indefinite article')), ('ve you
seen ', -12, None), ('a', 1, (3, 'indefinite article')), ('n ', -2,
None), ('aa', 2, (1, "String with 'aa'")), ('rdv', -3, None), ('a', 1,
(3, 'indefinite article')), ('rk?', -3, None)]
>>>

Each returned tuple has three elements:
  For a substring which is not a match these are:
   - the unmatched substring
   - the negative length of the substring
   - None

  For a substring which matches:
   - the matched substring
   - the positive length of the substring
   - the value of the match in the TST (the tree)

It uses Aho-Corasick for the implementation which is fast and does what
you expect it to do.  Nor does it have a problem of matching more than
99 possible strings as the regexp approach may have.

                                Andrew
                                dalke at dalkescientific.com




More information about the Python-list mailing list