[XML-SIG] speed question re DOM parsing
Wed, 31 May 2000 20:58:05 -0600
Greg Stein wrote:
> On Wed, 24 May 2000, Bjorn Pettersen wrote:
> > I'm just starting to work with XML, so be gentle <wink>
> > The problem is that I'm reading in a 280K xml file using the sample code
> > from the XML howto:
> > def getXmlDomDocument(name):
> > p = saxexts.make_parser()
> > dh = SaxBuilder()
> > p.setDocumentHandler(dh)
> > p.parseFile(open(name))
> > p.close()
> > doc = dh.document
> > xml.dom.utils.strip_whitespace(doc)
> > return doc
> > it takes about five seconds to read and parse the file...
> > Is there a better way to read the file (or is there updated code that is
> > faster)?
> If you want a DOM for the output, then no... you'll have to deal with the
> speed. If you have simple requirements for the Python representation of
> the XML, then take a look at xml.utils.qp_xml.
Ok, time for an update ;-)
I've been using the qp_xml.Parser class for a couple of days with good
results. With xml files of ~500K parsing takes less than 2 secs. I
just got a 1.2Mb xml file however, and the parsing time went up to a
little over 50 secs...
After some profiling, I found that most of the time was going into the
else branch in the cdata method. This branch is growing a string
character by character by saying:
elem.first_cdata = elem.first_cdata + data
testing my assumption I switched elem.first_cdata to be a
cStringIO.StringIO object (I was lazy enough to not implement a
__getattr__). With only this change, the parsing time went down to
about 2.5 secs(!).
Question: does using StringIO (or perhaps array) and __getattr__ sound
like the right thing to do? (and if so, should I polish my changes and
ps: I'm running on a Pentium-II/450Mhz with 256Mb RAM (in case you
thought I was swapping :-)