[Tutor] re module fails to handle text > 16142 characters ???

pan@uchicago.edu pan@uchicago.edu
Wed May 21 00:31:01 2003


> Since this is the Tutor list, how about showing us the actual regexp you
> used?  Then we can try to rewrite it in such a way that you won't bump into
> the recursion limit.  This could be easy, hard, or impossible, but in any
> case should be educational <wink>.

I was trying to parse an 'archive message' such as those messages
in the Tutor list, in an attempt to write an "archive reader". The 
message body is wrapped within :


some header ...

<!--beginarticle-->

  ... message body here ...
 
<!--endarticle-->

some footer ...


It turns out to be a somewhat universal format been used in many
different list archives.

My idea was to take that message body and present it in a customized 
way. The following is the a function that I used to grab it:

def getSection(doc, startTag, endTag,
                   mustHave='',
                   ignoreCase=1,                   
                   returnTags=0): 
   if ignoreCase:
       flag = re.DOTALL | re.IGNORECASE
   else:
       flag = re.DOTALL
            
   if returnTags:
       ptn= '(%s.*?%s.*?%s)' %(startTag, mustHave, endTag)
   else:            
       ptn= '%s(.*?%s.*?)%s' %(startTag, mustHave, endTag)

   reObj = re.compile(ptn, flag)
   return reObj.findall(doc)[0]


So if I load a message from an archive site, save it to 'msg', and
use:
   
msgBody = getSection ( msg, '<!--beginarticle-->', '<!--endarticle-->)

that should give me the message body.


As I mentioned earlier, this way of parsing the text raised a maximum 
recursion error when the len(msg) is > 16142.

After identifying the source of error with the help from you guys, I 
actually modified the code as:

msgBody = msg.split('<!--beginarticle-->')[1].split('<!--endarticle-->')[0]

which is working nicely.

As for 'how to modify the regex approach to avoid the max recursion error',
I really have no idea....

pan