Generating text from a regular expression

Gabriel Genellina gagsl-py2 at yahoo.com.ar
Thu Apr 1 05:17:19 EDT 2010


En Wed, 31 Mar 2010 12:23:48 -0300, Paul McGuire <ptmcg at austin.rr.com>
escribió:
> On Mar 31, 5:49 am, Nathan Harmston <iwanttobeabad... at googlemail.com>
> wrote:
>>
>> I have a slightly complicated/medium sized regular expression and I
>> want to generate all possible words that it can match (to compare
>> performance of regex against an acora based matcher).
>
> The pyparsing wiki Examples page includes this regex inverter:
> http://pyparsing.wikispaces.com/file/view/invRegex.py
>
>> From the module header:
> # Supports:
> # - {n} and {m,n} repetition, but not unbounded + or * repetition
> # - ? optional elements
> # - [] character ranges
> # - () grouping
> # - | alternation

I took the liberty of modifying your invRegex.py example, adding support
for infinite repeaters. It depends on two other modules:

mergeinf.py (from http://code.activestate.com/recipes/577041) provides the
infinite merge operation.

enumre.py provides the basic functions (merge, prod, repeat, closure)
necessary to enumerate the language generated by a given regular
expression, even if it contains unbounded repeaters like *,+.  The key is
to generate shorter strings before longer ones, so in 'a*|b*' it doesn't
get stuck generating infinite a's before any b.

By example, "(a|bc)*d" corresponds to this code:

       prod(
         closure(
           merge(
             'a',
              prod('b','c'))),
         'd')

which returns an infinite generator starting with:

d
ad
aad
bcd
aaad
abcd
bcad
aaaad
aabcd
abcad
bcaad
bcbcd
aaaaad
aaabcd
aabcad
...


I got the idea from
http://userweb.cs.utexas.edu/users/misra/Notes.dir/RegExp.pdf

Finally, invRegexInf.py is based on your original regex parser. I only
modified the generation part, taking advantage of the above
infrastructure; the parser itself remains almost the same. It essentially
saves oneself the very tedious work of converting a regular expression
into the equivalent sequence of function calls as shown above. (I hope I
got it right: I like pyparsing a lot and use it whenever I feel it's
appropriate, but not as often as to remember the details...)

-- 
Gabriel Genellina
-------------- next part --------------
A non-text attachment was scrubbed...
Name: invRegexInf.py
Type: application/octet-stream
Size: 7691 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-list/attachments/20100401/40230735/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: enumre.py
Type: application/octet-stream
Size: 10669 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-list/attachments/20100401/40230735/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mergeinf.py
Type: application/octet-stream
Size: 4546 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-list/attachments/20100401/40230735/attachment-0002.obj>


More information about the Python-list mailing list