Regular expression help
Bengt Richter
bokr at oz.net
Thu Jul 17 17:15:00 EDT 2003
On Thu, 17 Jul 2003 08:44:50 +0200, "Fredrik Lundh" <fredrik at pythonware.com> wrote:
>David Lees wrote:
>
>> I forget how to find multiple instances of stuff between tags using
>> regular expressions. Specifically I want to find all the text between a
>> series of begin/end pairs in a multiline file.
>>
>> I tried:
>> >>> p = 'begin(.*)end'
>> >>> m = re.search(p,s,re.DOTALL)
>>
>> and got everything between the first begin and last end. I guess
>> because of a greedy match. What I want to do is a list where each
>> element is the text between another begin/end pair.
>
>people will tell you to use non-greedy matches, but that's often a
>bad idea in cases like this: the RE engine has to store lots of back-
would you say so for this case? Or how like this case?
>tracking information, and your program will consume a lot more
>memory than it has to (and may run out of stack and/or memory).
For the above case, wouldn't the regex compile to a state machine
that just has a few states to recognize e out of .* and then revert to .*
if the next is not n, and if it is, then look for d similarly, and if not,
revert to .*, etc or finish? For a short terminating match, it would seem
relatively cheap?
>at this point, it's also obvious that you don't really have to use
>regular expressions:
>
> pos = 0
>
> while 1:
> start = text.find("begin", pos)
> if start < 0:
> break
> start += 5
> end = text.find("end", start)
> if end < 0:
> break
> process(text[start:end])
> pos = end # move forward
>
></F>
Or breaking your loop with an exception instead of tests:
>>> text = """begin s1 end
... sdfsdf
... begin s2 end
... """
>>> def process(s): print 'processing(%r)'%s
...
>>> try:
... end = 0 # end of previous search
... while 1:
... start = text.index("begin", end) + 5
... end = text.index("end", start)
... process(text[start:end])
... except ValueError:
... pass
...
processing(' s1 ')
processing(' s2 ')
Or if you're guaranteed that every begin has an end, you could also write
>>> for begxxx in text.split('begin')[1:]:
... process(begxxx.split('end')[0])
...
processing(' s1 ')
processing(' s2 ')
Regards,
Bengt Richter
More information about the Python-list
mailing list