Regular expression help

Fri Jul 18 01:20:37 EDT 2003

Bengt Richter wrote:

> On Fri, 18 Jul 2003 04:31:32 GMT, David Lees <abcdebl2nonspammy at verizon.net> wrote:
> 
> 
>>Andrew Bennetts wrote:
>>
>>>On Thu, Jul 17, 2003 at 04:27:23AM +0000, David Lees wrote:
>>>
>>>
>>>>I forget how to find multiple instances of stuff between tags using 
>>>>regular expressions.  Specifically I want to find all the text between a 
>>>
>>>                                               ^^^^^^^^
>>>
>>>How about re.findall?
>>>
>>>E.g.:
>>>
>>>    >>> re.findall('BEGIN(.*?)END', 'BEGIN foo END   BEGIN bar END') 
>>>    [' foo ', ' bar ']
>>>
>>>-Andrew.
>>>
>>>
>>
>>Actually this fails with the multi-line type of file I was asking about.
>>
>>
>>>>>re.findall('BEGIN(.*?)END', 'BEGIN foo\nmumble END   BEGIN bar END')
>>
>>[' bar ']
>>
> 
> It works if you include the DOTALL flag (?s) at the beginning, which makes
> . also match \n: (BTW, (?si) would make it case-insensitive).
> 
>  >>> import re
>  >>> re.findall('(?s)BEGIN(.*?)END', 'BEGIN foo\nmumble END   BEGIN bar END')
>  [' foo\nmumble ', ' bar ']
> 
> Regards,
> Bengt Richter
I just tried to benchmark both Fredrik's suggestions along with Bengt's 
using the same input file.  The results (looping 200 times over the 400k 
file) are:
Fredrik, regex =  1.74003930667
Fredrik, no regex =  0.434207978947
Bengt, regex =  1.45420158149

Interesting how much faster the non-regex approach is.

Thanks again.

David Lees

The code (which I have not carefully checked) is:

import re, time

def timeBengt(s,N):
     p = 'begin msc(.*?)end msc'
     rx =re.compile(p,re.DOTALL)
     t0 = time.clock()
     for i in xrange(N):
         x = x = rx.findall(s)
     t1 = time.clock()
     return t1-t0

def timeFredrik1(text,N):
     t0 = time.clock()
     for i in xrange(N):
         pos = 0

         START = re.compile("begin")
         END = re.compile("end")

         while 1:
             m = START.search(text, pos)
             if not m:
                 break
             start = m.end()
             m = END.search(text, start)
             if not m:
                 break
             end = m.start()
             pass
             pos = m.end() # move forward
     t1 = time.clock()
     return t1-t0

def timeFredrik(text,N):
     t0 = time.clock()
     for i in xrange(N):
         pos = 0
         while 1:
             start = text.find("begin msc", pos)
             if start < 0:
                 break
             start += 9
             end = text.find("end msc", start)
             if end < 0:
                 break
             pass
             pos = end # move forward

     t1 = time.clock()
     return t1-t0

fh = open('scu.cfg','rb')
s = fh.read()
fh.close()

N = 200
print 'Fredrik, regex = ',timeFredrik1(s,N)
print 'Fredrik, no regex = ',timeFredrik(s,N)
print 'Bengt, regex = ',timeBengt(s,N)