Regex recursion error example.

Bengt Richter bokr at oz.net
Fri Nov 1 18:59:39 EST 2002


On 1 Nov 2002 07:13:20 -0800, yin_12180 at yahoo.com (Yin) wrote:

>After tinkering with this issue for a day or so, I've decided to use
>xmllib to solve the problem.  But for future reference, I've attached
>the piece of text that is failing and the two approaches that I've
>tried to make the match.
>
>Of course there are other approaches to doing this parse, but I am
>interested in understanding the regex approach I am trying and its
>limitations.
>
>If there are no solutions using regex, I would be interested in seeing
>a reference to articles or books that discuss overcoming particularly
>long string matches.
>
>Approach 1:
>pattern=re.compile('<PubMedArticle>(.*?)</PubMedArticle>',
>re.DOTALL)
>self.citationlist = re.findall(pattern, allinput)
>
>Approach 2:
>comppat=re.compile(r'<PubMedArticle>((?:(?!<PubMedArticle>).)*)</PubMedArticle>',
>re.DOTALL)
>self.citationlist = re.findall(pattern, allinput)
>
>There are three matching to make in this body of text.  The above code
>has been failing on the second of the third.  This problem has only
>been occuring on linux python and Windows python (the stack in Windows
>is just larger enough to accomadate the matches.
>Text to match:
>
>http://160.129.203.97/1998_xmltest.html
>
Here's a little different approach you could try:

 >>> import re
 >>> import urllib
 >>> allinput = urllib.urlopen('http://160.129.203.97/1998_xmltest.html').read()
 >>> len(allinput)
 29714
 >>> pattern=re.compile('(</?PubMedArticle>)',re.DOTALL)
 >>> allsplit = pattern.split(allinput)

In the following, allsplit[i] is the (.*?) text you wanted, I think, but it's a bit long, so
I just printed the first and last 80 chars and bracketed with <wyw> <...> </wyw>
([w]hat [y]ou [w]ant ;-):

 >>> for i in range(2,len(allsplit),4): print '<wyw>%s\n<...>\n%s</wyw>\n' % (
 ...                    allsplit[i][:80],allsplit[i][-80:])
 ...
 <wyw>
 <MedlineCitation Status="Completed">
 <MedlineID>99071918&
 <...>
 quot;>99071918</ArticleId>
         </ArticleIdList>
 </PubmedData>
 </wyw>

 <wyw>
 <MedlineCitation Status="Completed">
 <MedlineID>99071917&
 <...>
 quot;>99071917</ArticleId>
         </ArticleIdList>
 </PubmedData>
 </wyw>

 <wyw>
 <MedlineCitation Status="Completed">
 <MedlineID>99071916&
 <...>
 quot;>99071916</ArticleId>
         </ArticleIdList>
 </PubmedData>
 </wyw>

Of course, this depends on there being no missing tags for <PubMedArticle> .. </PubMedArticle>
and no alternative forms of those tags.

Regards,
Bengt Richter



More information about the Python-list mailing list