[XML-SIG] speed question re DOM parsing

Walter Underwood wunder@ultraseek.com
Thu, 22 Jun 2000 09:37:56 -0700

I know this is a reply to a really old post, but the talk about
speedup reminded me that I hadn't answered it.

This is worth posting generally, because it is applicable to
lots of SAX handlers (add it to the documentation?).

--On Wednesday, May 31, 2000 8:58 PM -0600 Bjorn Pettersen
<bjorn@roguewave.com> wrote:
> After some profiling, I found that most of the time was going into the
> else branch in the cdata method.  This branch is growing a string
> character by character by saying:
>   elem.first_cdata = elem.first_cdata + data

I had one of those in my character data handler too. Parsing the
Old Testament took about 45 min, as I remember. The copies and
reallocs in concatenation are O(n**2). Save all the strings in
a list, then use string.join at the end. This is linear.

Here are the relevant fragments of the class with the handlers:

class XMLToText:
    def __init__(self):
        self.text = []

    def cdata(self, data):

    def finish(self):
        self.text = string.join(self.text,u'')

Note the Unicode string constant -- remove that for Python 1.5,
and add code to handle the UTF-8, if necessary.

Walter R. Underwood
Senior Staff Engineer, Ultraseek Corp.