[Tutor] re module fails to handle text > 16142 characters ???
pan@uchicago.edu
pan@uchicago.edu
Wed May 21 00:31:01 2003
> Since this is the Tutor list, how about showing us the actual regexp you
> used? Then we can try to rewrite it in such a way that you won't bump into
> the recursion limit. This could be easy, hard, or impossible, but in any
> case should be educational <wink>.
I was trying to parse an 'archive message' such as those messages
in the Tutor list, in an attempt to write an "archive reader". The
message body is wrapped within :
some header ...
<!--beginarticle-->
... message body here ...
<!--endarticle-->
some footer ...
It turns out to be a somewhat universal format been used in many
different list archives.
My idea was to take that message body and present it in a customized
way. The following is the a function that I used to grab it:
def getSection(doc, startTag, endTag,
mustHave='',
ignoreCase=1,
returnTags=0):
if ignoreCase:
flag = re.DOTALL | re.IGNORECASE
else:
flag = re.DOTALL
if returnTags:
ptn= '(%s.*?%s.*?%s)' %(startTag, mustHave, endTag)
else:
ptn= '%s(.*?%s.*?)%s' %(startTag, mustHave, endTag)
reObj = re.compile(ptn, flag)
return reObj.findall(doc)[0]
So if I load a message from an archive site, save it to 'msg', and
use:
msgBody = getSection ( msg, '<!--beginarticle-->', '<!--endarticle-->)
that should give me the message body.
As I mentioned earlier, this way of parsing the text raised a maximum
recursion error when the len(msg) is > 16142.
After identifying the source of error with the help from you guys, I
actually modified the code as:
msgBody = msg.split('<!--beginarticle-->')[1].split('<!--endarticle-->')[0]
which is working nicely.
As for 'how to modify the regex approach to avoid the max recursion error',
I really have no idea....
pan