[XML-SIG] speed question re DOM parsing

Bjorn Pettersen bjorn@roguewave.com
Wed, 31 May 2000 20:58:05 -0600


Greg Stein wrote:
> 
> On Wed, 24 May 2000, Bjorn Pettersen wrote:
> > I'm just starting to work with XML, so be gentle <wink>
> >
> > The problem is that I'm reading in a 280K xml file using the sample code
> > from the XML howto:
> >
> > def getXmlDomDocument(name):
> >         p = saxexts.make_parser()
> >         dh = SaxBuilder()
> >         p.setDocumentHandler(dh)
> >         p.parseFile(open(name))
> >         p.close()
> >         doc = dh.document
> >         xml.dom.utils.strip_whitespace(doc)
> >         return doc
> >
> > it takes about five seconds to read and parse the file...
> >
> > Is there a better way to read the file (or is there updated code that is
> > faster)?
> 
> If you want a DOM for the output, then no... you'll have to deal with the
> speed. If you have simple requirements for the Python representation of
> the XML, then take a look at xml.utils.qp_xml.
> 
> Cheers,
> -g

Ok, time for an update ;-)

I've been using the qp_xml.Parser class for a couple of days with good
results.  With xml files of ~500K parsing takes less than 2 secs.  I
just got a 1.2Mb xml file however, and the parsing time went up to a
little over 50 secs...

After some profiling, I found that most of the time was going into the
else branch in the cdata method.  This branch is growing a string
character by character by saying:

  elem.first_cdata = elem.first_cdata + data

testing my assumption I switched elem.first_cdata to be a
cStringIO.StringIO object (I was lazy enough to not implement a
__getattr__).  With only this change, the parsing time went down to
about 2.5 secs(!).

Question:  does using StringIO (or perhaps array) and __getattr__ sound
like the right thing to do? (and if so, should I polish my changes and
submit them?)

-- bjorn

ps: I'm running on a Pentium-II/450Mhz with 256Mb RAM (in case you
thought I was swapping :-)