Generating all permutations from a regexp

Mon Dec 25 19:12:15 EST 2006

On 23 Dec 2006 04:23:09 -0800, Chris Johnson <effigies at gmail.com> wrote:
>
> BJörn Lindqvist wrote:
> > With regexps you can search for strings matching it. For example,
> > given the regexp: "foobar\d\d\d". "foobar123" would match. I want to
> > do the reverse, from a regexp generate all strings that could match
> > it.
> >
> > The regexp: "[A-Z]{3}\d{3}" should generate the strings "AAA000",
> > "AAA001", "AAA002" ... "AAB000", "AAB001" ... "ZZZ999".
> >
> > Is this possible to do? Obviously, for some regexps the set of matches
> > is unbounded (a list of everything that matches "*" would be very
> > unpractical), but how would you do it for simple regexps like the one
> > above?
>
> For a very small number of characters, it would be feasible. For any
> finite number of characters, it would be possible (though it wouldn't
> take much to take longer than the age of the universe). For reference,
> in your simple example, you have 17,576,000 matching strings.
>
> I'm curious as to why you would wish to do this. I certainly understand
> considering hard problems for their own sake, but when I formulate
> them, there's always some impetus that makes me say "Huh. Now I
> wonder..."

I have a thousand use cases in mind. For example:

1. Generate sentences for a bot:

ipermute("(I|You|He|She|It|They) do( not)? (dis)?like (you|him|her|me|they)"):

Generates many grammatically incorrect sentences but you get the point.

2. Generate URL:s to web resources:

The following should generate URL:s to all c.l.p digests from the mail archive:

def download_clp():
    year_re = "(199\d|200[0-6])"
    months = ["January",
              "February",
              "March",
              "April",
              "May",
              "June",
              "July",
              "August",
              "September",
              "October",
              "November",
              "December"]
    month_re = '(' + '|'.join(months) + ')'
    fmt = "http://mail\.python\.org/pipermail/python-list/%s-%s\.txt"
    url_re = fmt % (year_re, month_re)
    for x in ipermute(url_re):
        print "Downloading", x
        code to download here

The same approach could be used to download all threads in a forum for
example, or all articles on Slashdot.

3. "Visualising" regular expressions. I think it would be easier to
write regular expressions if you could see what kind of data they
would match.

4. Port scanning:

ip_tuple = "(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])"
for x in ipermute("192\.168\." + "\.".join([ip_tuple] * 2)):
    scan_ip(x)

-- 
mvh Björn