[XML-SIG] Performance question
Henry S. Thompson
ht@cogsci.ed.ac.uk
06 Nov 2002 13:59:16 +0000
"Fred L. Drake, Jr." <fdrake@acm.org> writes:
> Henry S. Thompson writes:
> > If you want _another_ factor of 10, go to PyLTXML. The report below
> > is from Python 2.2.1 on RedHat Linux 7.2 using PyXML 0.8.1 and
> > PyLTXML-1.3-2.
>
> Wow! That's fast!
>
> > I used Fred's driver, added two new functions to text bit-level and
> > tree-level access via PyLTXML.
> >
> > parser performance test
> > 100 parses took 3.88 seconds, or 0.04 seconds/parse
> > 100 parses took 0.25 seconds, or 0.00 seconds/parse
> > 100 parses took 0.02 seconds, or 0.00 seconds/parse
> > 100 parses took 0.03 seconds, or 0.00 seconds/parse
> >
> > The first measurement is the original 4DOM DOM builder, the second is
> > the expatbuilder, the third is PyLTXML returning the whole tree, the
> > fourth is PyLTXML returning every bit (start tag, end tag, text). I
> > guess the tree is faster because it's slightly lazy wrt Python
> > structures, i.e. only the root is in Python form as returned, the rest
> > gets converted from the native C structs as you walk the Python tree.
>
> So is the resulting object compliant (or at least close) to the Python
> DOM, as defined in the Python Library Reference?
>
> http://www.python.org/doc/current/lib/module-xml.dom.html
Close.
> (Lazy building of structures is fine, of course, since that's
> implementation.) If it doesn't support the DOM API, does it support
> something with an equivalent model and functionality?
I believe so -- our model actually _predates_ the DOM, and we've never
had the time/resources to roll it forward, but it was of course
solving the same problem.
The documentation lists the following Python object types:
FileType
DoctypeType
ElementTypeType
ContentParticleType
AttrDefnType
BitType
ItemType
OOBType
ERefType
QueryType
These correspond to the xml.dom objects as follows, I think:
FileType * 13.6.2.1 DOMImplementation Objects
ItemType * 13.6.2.2 Node Objects
python tuple * 13.6.2.3 NodeList Objects
DoctypeType * 13.6.2.4 DocumentType Objects
FileType * 13.6.2.5 Document Objects
ItemType * 13.6.2.6 Element Objects
not exposed * 13.6.2.7 Attr Objects
not exposed * 13.6.2.8 NamedNodeMap Objects
OOBType * 13.6.2.9 Comment Objects
ItemType * 13.6.2.10 Text and CDATASection Objects
OOBType * 13.6.2.11 ProcessingInstruction Objects
The details are in the documentation which comes with the source
distribution, which uses distutils and is GPL-click-wrapped at
http://www.ltg.ed.ac.uk/software/xml/
To avoid hassle, you'll want the source and the appropriate binary
distribution at a minimum -- actually _building_ the extension
requires an LT XML installation as well.
ht
--
Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
W3C Fellow 1999--2002, part-time member of W3C Team
2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk
URL: http://www.ltg.ed.ac.uk/~ht/
[mail really from me _always_ has this .sig -- mail without it is forged spam]