Bottleneck? More efficient regular expression?

Tue Sep 23 11:40:18 EDT 2003

Hello,

I've been struggling with a regular expression for parsing XML files, which keeps giving the run time error "maximum
recursion limit exceeded". Here is the pattern string:

r'<code>(?P<c>.*?)</code>.*?<targetSeq
name="(?P<tn>.*?)">.*?<target>(?P<t>.*?)</target>.*?<align>(?P<a>.*?)</align>.*?<template>(?P<temp>.*?)</template>.*?<an
otherTag>(?P<at>.*?)</anotherTag>.*?<yetAnotherTag>(?P<yat>.*?)</yetAnotherTag>'

The file format is straighforward. Here is a sample:

<code>1cg2</code>
<chain>a</chain>
<settings>abcde</settings>
<scoreInfo>12345</scoreInfo>
<targetSeq name="1onc">blah
</targetSeq>
<alignment size="335">
<target>WLTFQKKHITNTRDVDCDNIMS</target>
<align> :| ..| :    .  |  .                         |.  .  :</align>
<template>QKRDNVLFQAATDEQPAVIKTLEKL</template>
<anotherTag>foobarfoobar</anotherTag>
<yetAnotherTag>barfoobarfoo</yetAnotherTag>

# this group of tags then repeat in the file multiple times

If I search for the pattern up to "</template>" (i.e. no <anotherTag> onwards), it works fine. As soon as I added the
later bits into the pattern it gives the error.

I heard that non-greedy (*?) is inefficient, so I tried replacing all .*? with (?!<target>) etc. which means "if the the
next piece of text doesn't match the <target> tag keep going". But it gives the same error.

So my question is: what is the bottleneck in this pattern? Could someone more experienced in REs give some hints here?

Your help is greatly appreciated!

Tina

-----= Posted via Newsfeeds.Com, Uncensored Usenet News =-----
http://www.newsfeeds.com - The #1 Newsgroup Service in the World!
-----==  Over 100,000 Newsgroups - 19 Different Servers! =-----