Bottleneck? More efficient regular expression?
adalke at mindspring.com
Wed Sep 24 21:37:23 CEST 2003
> Yes, I've shortened the XML snippet considerably as
> the data was obscurely long ...
But since there wasn't a problem with the code you posted
it's hard to figure out what's wrong. (And since it wasn't
in XML it made it even harder for people to help.)
If it's too long (and a threaded alignment which includes the
full PDB structure is too long), put it on a web site somewhere
and give a URL to it.
> Andrew: Thanks for the XML code. I've written up something
> similar using xml.parsers.expat, but it's conceivably slower
> than regexp.
Conceivable, yes. But 1) did you test it, and 2) would it make a
> And thanks for suggesting biopython.org. But I need to have it up
> quick so maybe cobbling something
> together myself is faster than reading through their documentation.
Not only that, but I don't know of anything in biopython for
handling that code. It's a generic XML parsing question, so a
generic tool is the best, like Fredrick's ElementTree.
Here's a question for you. When is it easier to read the
documentation for existing code (which has been tested)
then it is to write and debug your own code?
You can write the regex in a way that's easier to read
pattern = re.compile (
You can also make the pattern a bit less ambiguous,
eg, use [^>]* instead of .*? when you are inside an element,
(and use [^"]* instead of .*? for getting the text of an
You can get rid of the other ambiguity (skipping characters
until the start of the next tag) by using something like
And you can get rid of all ambiguities by having your
pattern match the text completely, that is, capturing all
fields even if ignore a few. In that way you don't have to
write code which skips over tags. That's easily done
with code which generated your pattern, as in
def match_tag(tagname, groupname):
return "<%s>(?P<%s>[^<]*)</%s>" % (tagname, groupname)
pattern = "[^<]*".join(match_tag("pdbcode", "pdbcode"), ...)
You'll need to extend it to support matching fields with
In any case, what you have won't support XML like
(I put extra spaces in the open tag.)
Instead, you are writing parsing code for the specific XML
subset your threading code produces, which may change in
the future. That's why you should use an XML parser.
And if you want to support only this specific format, it's
still easier to write a traditional line-oriented parser.
def readcheck(infile, start):
line = infile.readline()
def simpletag(line, convert = str):
i = line.find(">")
j = line.find("<", i)
infile = open("threading.xml")
line = readcheck(infile, "<threading")
_, name, _, source, _, template, _ = line.split('"')
line = readcheck(infile, "<pdbcode")
pdbcode = simpletag(line)
pdbchain = simpletag(readcheck(infile, "<pdbchain"))
templateName = simpletag(readcheck(infile, "<templateName"))
# don't save the settings
line = infile.readline()
This may be clumsier or more tedious, but it's easy to understand
and the ways it fails are much easier to diagnose than regexps.
That said, I like parsing with regexps. But this isn't the right
place to use them.
dalke at dalkescientific.com
More information about the Python-list