[XML-SIG] Re: Help parsing XML
Fredrik Lundh
fredrik at pythonware.com
Wed Mar 30 12:25:16 CEST 2005
Walter Underwood wrote:
> I'd just use a SAX interface. When you see id=HL as an attribute,
> close the old record and start a new one. Do the same thing at
> end of file. Done.
>
> Generally, if the structure is fairly fixed and you are extracting
> the data, think about using SAX. If the shape of the structure
> carries a lot of the information, you might need a DOM.
SAX is dead. if you're not using higher-level APIs, you doing more work than
you have to, and your code is likely to be slower and buggier than it should be.
here's the (c)ElementTree iterparse version:
try:
import cElementTree as ET
except ImportError:
from elementtree import ElementTree as ET
def process(record):
# receives a list of elements for this record
for elem in record:
print elem.tag,
elem.clear() # won't need this any more
print
record = []
for event, elem in ET.iterparse("test.xml"):
if elem.tag == "seg" and elem.get("id") == "HL":
process(record)
record = []
record.append(elem)
if record:
process(record)
(the cElementTree version of iterparse is about 5 times faster than xml.sax
on the parsing part, and putting state in local variables and logic in the loop
body is a lot more efficient than putting state in instance variables and logic
in a bunch of callback methods).
here's a "functional" version of the same thing, btw:
import cElementTree as ET
from itertools import groupby
from operator import itemgetter
def source(file):
# assign a unique serial to each HL group
serial = 0
for event, elem in ET.iterparse("test.xml"):
if elem.tag == "seg" and elem.get("id") == "HL":
serial += 1
yield serial, elem
for dummy, record in groupby(source("test.xml"), itemgetter(0)):
# process record
for dummy, elem in record:
print elem,
elem.clear()
print
</F>
More information about the XML-SIG
mailing list