[XML-SIG] SAX drivers comparison (PyXML 0.54).

Stefane Fermigier sf@fermigier.com
Mon, 15 May 2000 12:08:48 +0200


Hi,

I wrote the following script to test SAX divers speed and compatibility.

The results, when run on http://www.dmoz.org/rdf/content.example.txt
are:

Parser: xml.sax.drivers.drv_sgmlop, time: 0.001875, 0 bytes written.
Parser: xml.sax.drivers.drv_pyexpat, time: 0.109332, 4533 bytes written.
!!! xml.sax.drivers.drv_xmltok Error No parsers found
Parser: xml.sax.drivers.drv_xmlproc, time: 0.611368, 4996 bytes written.
!!! xml.sax.drivers.drv_xmltoolkit Error No parsers found
Parser: xml.sax.drivers.drv_xmllib, time: 1.250232, 28223 bytes written.
!!! xml.sax.drivers.drv_xmldc Error No parsers found

What I find most annoying is the fact that no one of the 3 drivers
that work (and I had to change
/usr/lib/python1.5/site-packages/xml/sax/drivers/drv_sgmlop.py line 76
so that sgmlop works, maybe that was a mistake ?) give the
same result (<num> bytes written by a trivial document handler.

Here's the script:

###############################################################################

import time, traceback, sys, StringIO
import xml.sax.saxexts, xml.sax.saxlib

parser_names = ["xml.sax.drivers.drv_sgmlop", "xml.sax.drivers.drv_pyexpat",
        "xml.sax.drivers.drv_xmltok", "xml.sax.drivers.drv_xmlproc",
        "xml.sax.drivers.drv_xmltoolkit", "xml.sax.drivers.drv_xmllib",
        "xml.sax.drivers.drv_xmldc"]

class ContentHandler(xml.sax.saxlib.DocumentHandler):
        def __init__(self, buff):
                self.buff = buff

        def startElement(self, name, attrs):
                self.buff.write(name + '\n')


for parser_name in parser_names:
        try:
                parser = xml.sax.saxexts.make_parser(parser_name)
                buff = StringIO.StringIO()
                parser.setDocumentHandler(ContentHandler(buff))

                start = time.time()
                parser.parseFile(open(sys.argv[1]))
                buff.seek(0)
                print "Parser: %s, time: %f, %d bytes written." % (
                        parser_name, time.time() - start, len(buff.read()))
        except:
                #traceback.print_exc()
                print '!!!', parser_name, 'Error', sys.exc_info()[1]

###############################################################################

Regards,

(FYI my current goal is to parse as fast a possible something like
http://www.dmoz.org/rdf/content.rdf.u8.gz which is a 500+ Mb XML file).

	S.

-- 
Stéfane Fermigier, Tel: 06 63 04 12 77 (mobile).
<www.portalux.com>: le portail Linux / logiciel libre.
"Amazon: we patent the dot in .com"