[XML-SIG] SAX drivers comparison (PyXML 0.54).
Stefane Fermigier
sf@fermigier.com
Mon, 15 May 2000 12:08:48 +0200
Hi,
I wrote the following script to test SAX divers speed and compatibility.
The results, when run on http://www.dmoz.org/rdf/content.example.txt
are:
Parser: xml.sax.drivers.drv_sgmlop, time: 0.001875, 0 bytes written.
Parser: xml.sax.drivers.drv_pyexpat, time: 0.109332, 4533 bytes written.
!!! xml.sax.drivers.drv_xmltok Error No parsers found
Parser: xml.sax.drivers.drv_xmlproc, time: 0.611368, 4996 bytes written.
!!! xml.sax.drivers.drv_xmltoolkit Error No parsers found
Parser: xml.sax.drivers.drv_xmllib, time: 1.250232, 28223 bytes written.
!!! xml.sax.drivers.drv_xmldc Error No parsers found
What I find most annoying is the fact that no one of the 3 drivers
that work (and I had to change
/usr/lib/python1.5/site-packages/xml/sax/drivers/drv_sgmlop.py line 76
so that sgmlop works, maybe that was a mistake ?) give the
same result (<num> bytes written by a trivial document handler.
Here's the script:
###############################################################################
import time, traceback, sys, StringIO
import xml.sax.saxexts, xml.sax.saxlib
parser_names = ["xml.sax.drivers.drv_sgmlop", "xml.sax.drivers.drv_pyexpat",
"xml.sax.drivers.drv_xmltok", "xml.sax.drivers.drv_xmlproc",
"xml.sax.drivers.drv_xmltoolkit", "xml.sax.drivers.drv_xmllib",
"xml.sax.drivers.drv_xmldc"]
class ContentHandler(xml.sax.saxlib.DocumentHandler):
def __init__(self, buff):
self.buff = buff
def startElement(self, name, attrs):
self.buff.write(name + '\n')
for parser_name in parser_names:
try:
parser = xml.sax.saxexts.make_parser(parser_name)
buff = StringIO.StringIO()
parser.setDocumentHandler(ContentHandler(buff))
start = time.time()
parser.parseFile(open(sys.argv[1]))
buff.seek(0)
print "Parser: %s, time: %f, %d bytes written." % (
parser_name, time.time() - start, len(buff.read()))
except:
#traceback.print_exc()
print '!!!', parser_name, 'Error', sys.exc_info()[1]
###############################################################################
Regards,
(FYI my current goal is to parse as fast a possible something like
http://www.dmoz.org/rdf/content.rdf.u8.gz which is a 500+ Mb XML file).
S.
--
Stéfane Fermigier, Tel: 06 63 04 12 77 (mobile).
<www.portalux.com>: le portail Linux / logiciel libre.
"Amazon: we patent the dot in .com"