Bottleneck? More efficient regular expression?

Tue Sep 23 21:38:28 EDT 2003

Tina Li:
> I've been struggling with a regular expression for parsing XML files,
> which keeps giving the run time error "maximum recursion limit
> exceeded".
   [ ... in reply to David Eppstein ...]
> I'd like to know what caused theover-limit as well.

When matching a pattern against a string, an ambiguity may arise.
For example the pattern "AX|AY" when matched against text
which starts with "A" can match the "A" in "AX" or the "A" in
"AY".  There are different ways to resolve this ambiguity.

The standard way is to use push the current state on the stack
and try the first possible path.  Every ambiguity gets its own
push on the stack.  If that path fails, the system backtracks by
popping the state off the stack and trying the next possibility.

Every ambiguity point causes a stack push.  (Well, not quite.
It's possible to collapse some of the ambiguity in some cases.
For example, Python's sre regular expression system recognizes
the common prefix and changes the above pattern into
"A|[XY]".)

In your pattern you used .*? frequently.  The thing about . is that
it can match the "<" character, so every time you have a .*?<
you have an ambiguity.  Since you use non-greedy .*?, it takes
the < path first, and pushes the generic capture on the stack.

What I can't figure out though is why you have any stack
problems.  All your ambiguities are easy to figure out, so you
have at most one stack term for each .*?.

I then tried to reproduce it, and wasn't able to do so.  Here's
my code.  (I needed to use re.DOTALL because you want
the .*? to capture the newline as well.)

>>> import re
>>> s="""<code>1cg2</code>
... <chain>a</chain>
... <settings>abcde</settings>
... <scoreInfo>12345</scoreInfo>
... <targetSeq name="1onc">blah
... </targetSeq>
... <alignment size="335">
... <target>WLTFQKKHITNTRDVDCDNIMS</target>
... <align> :| ..| :    .  |  .                         |.  .  :</align>
... <template>QKRDNVLFQAATDEQPAVIKTLEKL</template>
... <anotherTag>foobarfoobar</anotherTag>
... <yetAnotherTag>barfoobarfoo</yetAnotherTag>
... """
>>> pattern = (r'<code>(?P<c>.*?)</code>.*?'
...     r'<targetSeq name="(?P<tn>.*?)">.*?'
...     r'<target>(?P<t>.*?)</target>.*?'
...     r'<align>(?P<a>.*?)</align>.*?'
...     r'<template>(?P<temp>.*?)</template>.*?'
...     r'<anotherTag>(?P<at>.*?)</anotherTag>.*?'
...     r'<yetAnotherTag>(?P<yat>.*?)</yetAnotherTag>')
>>> pat = re.compile(pattern, re.DOTALL)
>>> m = pat.search(s)
>>> for k, v in m.groupdict().items():
...  print k, "=", repr(v)
...
a = ' :| ..| :    .  |  .                         |.  .  :'
c = '1cg2'
temp = 'QKRDNVLFQAATDEQPAVIKTLEKL'
yat = 'barfoobarfoo'
tn = '1onc'
t = 'WLTFQKKHITNTRDVDCDNIMS'
at = 'foobarfoobar'
>>>

In any case, as others pointed out, if it's XML you should parse
it as XML.  You don't actually have XML (it needs to be
single rooted and the <alignment size="335"> has no close tag)
so assuming it was fixed, here's some code which uses the
standard XML parsing libraries in Python

>>> s="""<record>
... <code>1cg2</code>
... <chain>a</chain>
... <settings>abcde</settings>
... <scoreInfo>12345</scoreInfo>
... <targetSeq name="1onc">blah
... </targetSeq>
... <alignment size="335"/>
... <target>WLTFQKKHITNTRDVDCDNIMS</target>
... <align> :| ..| :    .  |  .                         |.  .  :</align>
... <template>QKRDNVLFQAATDEQPAVIKTLEKL</template>
... <anotherTag>foobarfoobar</anotherTag>
... <yetAnotherTag>barfoobarfoo</yetAnotherTag>
... </record>
... """
>>> from xml.sax import handler
>>> import xml.sax
>>> class GetFields(handler.ContentHandler):
...    def startDocument(self):
...       self.fields = {}
...       self.save_text = 0
...       self.text = None
...    def startElement(self, name, attrs):
...       if name == "record":
...          return
...       if name == "targetSeq":
...          self.fields["targetSeqName"] = attrs["name"]
...       elif name == "alignment":
...          self.fields["alignmentSize"] = attrs["size"]
...       self.save_text = 1
...       self.text = ""
...    def characters(self, text):
...       if self.save_text:
...          self.text += text
...    def endElement(self, name):
...       if name == "record":
...          return
...       self.fields[name] = self.text
...       self.text = None
...       self.save_text = 0
...
>>> gf = GetFields()
>>> xml.sax.parseString(s, gf)
>>> gf
>>> for k, v in gf.fields.items():
...  print k, "=", repr(v)
...
code = u'1cg2'
yetAnotherTag = u'barfoobarfoo'
target = u'WLTFQKKHITNTRDVDCDNIMS'
chain = u'a'
scoreInfo = u'12345'
align = u' :| ..| :    .  |  .                         |.  .  :'
settings = u'abcde'
anotherTag = u'foobarfoobar'
template = u'QKRDNVLFQAATDEQPAVIKTLEKL'
alignmentSize = u'335'
targetSeq = u'blah\n'
targetSeqName = u'1onc'
alignment = ''
>>>

Another approach is to turn the XML into a DOM but I
have much less experience with that.  (I get confused with
the DOM API every time I try, and I've never needed
to get that far with it.)

> The file format is straighforward. Here is a sample:
>
> <code>1cg2</code>
> <chain>a</chain>

Hmmm...  Looks like you're doing sequence alignments against
PDB structures.  Do you know about biopython.org?

> So my question is: what is the bottleneck in this pattern? Could someone
> more experienced in REs give some hints here?

You're going to have to post your reproducible for me to
help any further than this.

                    Andrew
                    dalke at dalkescientific.com