Seeking regex optimizer
andrewdalke at gmail.com
andrewdalke at gmail.com
Mon Jun 19 16:51:34 EDT 2006
Kay Schluehr wrote:
> I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a
> regular expression sx from it, such that sx.match(s) yields a SRE_Match
> object when s starts with an s_i for one i in [0,...,n].
Why do you want to use a regex for this? When you have constant
strings there are other solutions. For example, this uses Nicolas
Lehuen's
pytst module from http://nicolas.lehuen.com/index.php/Pytst
>>> import tst
>>> tree = tst.TST()
>>> tree["aa"] = (1, "String with 'aa'")
>>> tree["aahhh"] = (2, "At the doctor's")
>>> tree["a"] = (3, "indefinite article")
>>> tree.scan("This is a bunch of text. Have you seen an aardvark?",
... tst.TupleListAction())
[('This is ', -8, None), ('a', 1, (3, 'indefinite article')), (' bunch
of text. H', -18, None), ('a', 1, (3, 'indefinite article')), ('ve you
seen ', -12, None), ('a', 1, (3, 'indefinite article')), ('n ', -2,
None), ('aa', 2, (1, "String with 'aa'")), ('rdv', -3, None), ('a', 1,
(3, 'indefinite article')), ('rk?', -3, None)]
>>>
Each returned tuple has three elements:
For a substring which is not a match these are:
- the unmatched substring
- the negative length of the substring
- None
For a substring which matches:
- the matched substring
- the positive length of the substring
- the value of the match in the TST (the tree)
It uses Aho-Corasick for the implementation which is fast and does what
you expect it to do. Nor does it have a problem of matching more than
99 possible strings as the regexp approach may have.
Andrew
dalke at dalkescientific.com
More information about the Python-list
mailing list