python xml DOM? pulldom? SAX?

Alan Kennedy alanmk at
Mon Aug 29 19:32:13 CEST 2005

 > I want to get text out of some nodes of a huge xml file (1,5 GB). The
 > architecture of the xml file is something like this


 > I want to combine the text out of page:title and page:revision:text
 > for every single page element. One by one I want to index these
 > combined texts (so for each page one index)
 > What is the most efficient API for that?:
 > SAX ( I don´t thonk so)

SAX is perfect for the job. See code below.

 > DOM

If your XML file is 1.5G, you'll need *lots* of RAM and virtual memory 
to load it into a DOM.

 > or pulldom?

Not sure how pulldom does it's pull "optimizations", but I think it 
still builds an in-memory object structure for your document, which will 
still take buckets of memory for such a big document. I could be wrong 

 > Or should I just use Xpath somehow.

Using xpath normally requires building a (D)OM, which will consume 
*lots* of memory for your document, regardless of how efficient the OM is.

Best to use SAX and XPATH-style expressions.

You can get a limited subset of xpath using a SAX handler and a stack. 
Your problem is particularly well suited to that kind of solution. Code 
that does a basic job of this for your specific problem is given below.

Note that there are a number of caveats with this code

1. characterdata handlers may get called multiple times for a single xml 
text() node. This is permitted in the SAX spec, and is basically a 
consequence of using buffered IO to read the contents of the xml file, 
e.g. the start of a text node is at the end of the last buffer read, and 
the rest of the text node is at the beginning of the next buffer.

2. This code assumes that your "revision/text" nodes do not contain 
mixed content, i.e. a mixture of elements and text, e.g. 
"<revision><text>This is a piece of <b>revision</b> 
text</text></revision>. The below code will fail to extract all 
character data in that case.

import xml.sax

class Page:

   def append(self, field_name, new_value):
     old_value = ""
     if hasattr(self, field_name):
       old_value = getattr(self, field_name)
     setattr(self, field_name, "%s%s" % (old_value, new_value))

class page_matcher(xml.sax.handler.ContentHandler):

   def __init__(self, page_handler=None):
     self.page_handler = page_handler
     self.stack = []

   def check_stack(self):
     stack_expr = "/" + "/".join(self.stack)
     if '/parent/page' == stack_expr: = Page()
     elif '/parent/page/title/text()' == stack_expr:'title', self.chardata)
     elif '/parent/page/revision/id/text()' == stack_expr:'revision_id', self.chardata)
     elif '/parent/page/revision/text/text()' == stack_expr:'revision_text', self.chardata)

   def startElement(self, elemname, attrs):

   def endElement(self, elemname):
     if elemname == 'page' and self.page_handler:
       self.page_handler( = None

   def characters(self, data):
     self.chardata = data

testdoc = """
      <title>Page number 1</title>
        <text>revision one</text>
      <title>Page number 2</title>
        <text>revision two</text>

def page_handler(new_page):
   print "New page"
   print "title\t\t%s" % new_page.title
   print "revision_id\t%s" % new_page.revision_id
   print "revision_text\t%s" % new_page.revision_text

if __name__ == "__main__":
   parser = xml.sax.make_parser()
   parser.setFeature(xml.sax.handler.feature_namespaces, 0)


alan kennedy
email alan:    

More information about the Python-list mailing list