[XML-SIG] I am confused...

Uche Ogbuji uche.ogbuji@fourthought.com
Mon, 29 Jan 2001 13:39:48 -0700


> > I remember I was doing queries in the form
> > "/article/author/name"
> > - and it was so slow... (0.5 - 1 sec per query on Celeron 400)
> 
> What kind of API did you use? For simple queries like this, a SAX
> ContentHandler may be sufficient. Using Uche's bigxml file, you can
> try
> 
> import xml.sax
> class NameRetriever(xml.sax.ContentHandler):
>     def __init__(self):
>         self.authors = []
>         self.in_author = self.in_name = 0
> 
>     def startElement(self, tag, attrs):
>         if tag=="author":
>             self.in_author = 1
>         else:
>             if self.in_author and tag == "name":
>                 self.in_name = 1
>                 self.txt = ""
> 
>     def characters(self,str):
>         if self.in_name:
>             self.txt = self.txt+str
> 
>     def endElement(self,tag):
>         if self.in_name and tag=="name":
>             self.authors.append(self.txt)
>             self.in_name=0
>         elif self.in_author and tag=="author":
>             self.in_author=0
> 
> h = NameRetriever()
> start=time.time();xml.sax.parse("bigxml",handler=h);end = time.time()
> print end - start
> print len(h.authors)

This one needs to go into the XML HOWTO as an example.  We now have an XPath 
and SAX approach.  It would be easy to add a DOM approach.  I'll try to do it 
with the extra 3 hours the Devil offered me today in exchange for the pinkie 
fingernail of my soul.

> To my own surprise, this is not as fast as the cDomlette; probably
> because the latter links directly with expat, and thus avoids a number
> of indirections. Still, it takes only three times as long (0.5s vs
> 1.4s on my machine), and it will work on any Python 2.0 installation.

Cool!  I must confess that I would have guessed that SAX was close to 
cDomlette.  Yes, PySAX does add quite a bit of overhead (which was one of the 
motivations for the PyExpat reader and cDomlette), but I would have though 
that the integration of the processing with the parsing would make up the 
advantage.

Looks as if we might want to consider expanding cDomlette into a full-blown 
mutable DOM, though Mike and I are still discussing the best internal data 
structures.


-- 
Uche Ogbuji                               Principal Consultant
uche.ogbuji@fourthought.com               +1 303 583 9900 x 101
Fourthought, Inc.                         http://Fourthought.com 
4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA
Software-engineering, knowledge-management, XML, CORBA, Linux, Python