Generating all permutations from a regexp
BJörn Lindqvist
bjourne at gmail.com
Mon Dec 25 19:12:15 EST 2006
On 23 Dec 2006 04:23:09 -0800, Chris Johnson <effigies at gmail.com> wrote:
>
> BJörn Lindqvist wrote:
> > With regexps you can search for strings matching it. For example,
> > given the regexp: "foobar\d\d\d". "foobar123" would match. I want to
> > do the reverse, from a regexp generate all strings that could match
> > it.
> >
> > The regexp: "[A-Z]{3}\d{3}" should generate the strings "AAA000",
> > "AAA001", "AAA002" ... "AAB000", "AAB001" ... "ZZZ999".
> >
> > Is this possible to do? Obviously, for some regexps the set of matches
> > is unbounded (a list of everything that matches "*" would be very
> > unpractical), but how would you do it for simple regexps like the one
> > above?
>
> For a very small number of characters, it would be feasible. For any
> finite number of characters, it would be possible (though it wouldn't
> take much to take longer than the age of the universe). For reference,
> in your simple example, you have 17,576,000 matching strings.
>
> I'm curious as to why you would wish to do this. I certainly understand
> considering hard problems for their own sake, but when I formulate
> them, there's always some impetus that makes me say "Huh. Now I
> wonder..."
I have a thousand use cases in mind. For example:
1. Generate sentences for a bot:
ipermute("(I|You|He|She|It|They) do( not)? (dis)?like (you|him|her|me|they)"):
Generates many grammatically incorrect sentences but you get the point.
2. Generate URL:s to web resources:
The following should generate URL:s to all c.l.p digests from the mail archive:
def download_clp():
year_re = "(199\d|200[0-6])"
months = ["January",
"February",
"March",
"April",
"May",
"June",
"July",
"August",
"September",
"October",
"November",
"December"]
month_re = '(' + '|'.join(months) + ')'
fmt = "http://mail\.python\.org/pipermail/python-list/%s-%s\.txt"
url_re = fmt % (year_re, month_re)
for x in ipermute(url_re):
print "Downloading", x
code to download here
The same approach could be used to download all threads in a forum for
example, or all articles on Slashdot.
3. "Visualising" regular expressions. I think it would be easier to
write regular expressions if you could see what kind of data they
would match.
4. Port scanning:
ip_tuple = "(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])"
for x in ipermute("192\.168\." + "\.".join([ip_tuple] * 2)):
scan_ip(x)
--
mvh Björn
More information about the Python-list
mailing list