[Tutor] Using Regex to produce text

Steven D'Aprano steve at pearwood.info
Thu Apr 29 01:16:56 CEST 2010


On Thu, 29 Apr 2010 06:36:18 am Lie Ryan wrote:
> On 04/29/10 01:32, mhw at doctors.net.uk wrote:
> > While some patterns are infinite, other's aren't (e.g. The example
> > I gave).
>
> How should the regex engine know about that?

The regex engine itself doesn't run in reverse, so it can't know this 
and doesn't need to. However, it is possible to write a reverse regex 
engine which does run in reverse, in which case it is up to the 
programmer who creates it to encode that knowledge in the engine.


> > Using a subset of Regex syntax to produce a set of strings has the
> > advantage of using a well understood and documented form, and if
> > you could hook into the existing API, at minimal coding effort.
> >
> > In addition, it allows a nice symmetry between search and
> > production of resource names.
>
> String generation is generally simpler than string parsing. If the
> pattern of the string you're generating is so complex that you need a
> regex-powered name generator, it will probably be impossible to parse
> that. 

What? That makes no sense. That's like saying "I have here a formula for 
generating a series of numbers which is so complicated that it is 
impossible to write a formula for it". Since the string was generated 
from a regex, it will be parsable by *exactly* the same regex.


> Use string interpolation/formatting instead: '%s_%0s.txt' % 
> (name, num)

All this does is delay the work. You still have to generate all possible 
names and nums. Since you haven't defined what they are meant to be, 
it's impossible to do so.


> > I suspect it's not that easy, as I don't think we can get to the
> > internals of the regex FSM. However, I thought it would be worth
> > asking.
>
> The problem is how you would define the "universe" set of characters.

The same way the regex engine does.


> If you had a '.', would you want alphanumeric only, all printable
> characters, all ASCII (0-127) characters, all byte (0-255) character,
> all Unicode characters? 

The regex engine defines . as meaning "any character except newline, or 
any character including newline if the dotall flag is given". The regex 
engine operates on byte strings unless you give it the unicode flag. 
Given that the original poster wants to stick to regex syntax rather 
than learn a different syntax with different definitions, then the 
universal set of characters is well defined.

Here is a generator which should do the job:

# Untested.
def gen_dot(dotall_flag=False, unicode_flag=False):
    """Iterate over the sequence of strings which matches ."""
    if not unicode_flag:
        all_chars = [chr(i) for i in range(256)]
        if not dotall_flag:
            all_chars.remove('\n')
        for c in all_chars:
            yield c
    else:
        # There are a *lot* of unicode characters, but I don't know
        # how many. Take the coward's way out.
        raise NotImplementedError('left as an exercise for the reader')




> It's too ambiguous and if you say to follow 
> what regex is doing, then regex just happen to not be choosing the
> most convenient default for pattern generators.

Whether regex rules are the most convenient, or whether learning yet 
another complicated, terse language is better, is not the question.



-- 
Steven D'Aprano


More information about the Tutor mailing list