[XML-SIG] SAX prettyprinter V2 and SGMLOP

Walter Underwood wunder@infoseek.com
Mon, 25 Jan 1999 10:17:09 -0800


At 04:44 PM 1/23/99 +0100, Christian Tismer wrote:
>What I need to find is the fastest acceptable parser which allows
>me to turn masses of XML data into Python structures. [...] we are 
>processing XML encoded database records which are quite irregular 
>(useless to use a relational database) and quite simple, but the 
>standard size is some 50MB. This is why I'm after speed, much more than
>conformance.

I'm using pyexpat for the XML support in our search engine.
At this point in development, I'm collecting text and associating
it with *every* enclosing element. So this is worst-case for
parsing time.

Parsing Jon Bosak's tagged "Old Testament" (3.4 megabytes) takes
30 seconds. That document is pretty heavily tagged, with an element
for each verse, each chapter, each book, the body, etc.

Collecting less information would probably be faster.

If you need a lot more speed than this (integer factors faster) 
you might need to do all the parsing in C. Remember that there
is a difference between a paser that implements all of XML and
a parser that extracts the data you need from your XML documents.
If you can trust the documents to be legal (perhaps they are 
checked when generated), then a hard-coded parser may be the
answer.

wunder


Walter R. Underwood
wunder@infoseek.com
wunder@best.com (home)
http://www.best.com/~wunder/
1-408-543-6946