Seeking regex optimizer

Sun Jun 18 17:09:33 EDT 2006

Paddy wrote:
> Kay Schluehr wrote:
> > I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a
> > regular expression sx from it, such that sx.match(s) yields a SRE_Match
> > object when s starts with an s_i for one i in [0,...,n].  There might
> > be relations between those strings: s_k.startswith(s_1) -> True or
> > s_k.endswith(s_1) -> True. An extreme case would be ls = ['a', 'aa',
> > ...,'aaaa...ab']. For this reason SRE_Match should provide the longest
> > possible match.
> >
> > Is there a Python module able to create an optimized regex rx from ls
> > for the given constraints?
> >
> > Regards,
> > Kay
>
> A start would be:
>   regexp = "^(" + "|".join(sorted(ls, reverse=True)) + ")"
> But the above does not work if you have special characters in your
> strings.

For special characters there might be a workaround using escapes. This
is indeed important, but I do think one should split the problem into
separate parts.

> You say you want something that is optimised. What have have you tried?

Sorting the list and checking for successor inclusions. Say you have ls
= ['x','a', 'aa', 'aab' ,'ab']

This can be mapped to:

'x|a(?:(?:ab)?|b?|a?)'

or to:

'^(x|ab|aab|aa|a)'

with reverse sorting as in your proposal.The naive solution is easy to
generate but I'm sceptical about its cost effectiveness. On the other
hand I do not want to investigate this matter if somebody else already
did it thoroughly.

Regards,
Kay