# Enumerating Regular Expressions

Tim Chase python.list at tim.thechases.com
Tue May 9 14:07:16 CEST 2006

> Why are people getting stuck on infinite regular
> languages?  I've made it quite clear that I'm only really
>  interested in doing this for finite languages, but that
> shouldn't matter anyway.

The power of regular expressions is that they define a
consice means to encapsulate an infinite number of inputs.
If you're not tapping this power, the general wisdom is to
not use regular expressions.

That said, unless you can ensure you don't have an infinite
grammer (in particular, no asterisks or plus-signs or
unbounded ranges like "{4,}"), you then have the problem of
how to navigate that infinite set.

Taking for example the most common/simple regexp:

".*"

That matches any and every string.  There's a whole library
of Congress for which every written work will match that
string.  It sounds a lot like Jorge Luis Borges's 1956 "The
Library of Babel".

Do you begin searching depth-first with "a" and then "aa"
and then "aaa"?  Or do you begin breadth-first with "a",
then "b"...then "z", then "aa", "ab", ...?  Depth-first is
far more efficient when searching a branching set of data,
but when you have infinite depth, it will take a while ;)
Alternatively, you can use some sort of iterative-deepening
search which puts an increasing bound on how deep the
depth-first search will go before giving up for that
particular iteration.  Googling on "iteritave deepening"
should bring back some ideas for that one.

If you truely have a finite grammer, it would be feasible to
write some sort of parser, ugly as it might be.  As a first
pass, I'd have my parser emit python code to write N nested
loops to try each combination of each element, so something like

"[a-zA-Z_][a-zA-Z0-9_]"

became something like

def tries():
s1 = set(range(ord('a'),ord('z')+1))
s2 = set(range(ord('a'),ord('z')+1))
for a1 in s1:
for a2 in s2:
yield "%s%s" % (chr(a1), chr(a2))

Thus, the input regexp would generate python code that you
could then include in your project.  Ugly, but a possible
solution.  There are lots of peculiarities though that would
need to be handled.  Branching, escaped characters in
character sets (well, escaping in general is a headache).

An alternative might be brute-forcing your regexp:

for a1 in range(0,256):
for a2 in range(0,256):
s = "%s%s" % (chr(a1),chr(a2))
m = re.match(s)
if m:
yield s

nested to however deep you're interested in.

All will be slow.  All are pretty ugly.

-tkc