[XML-SIG] Re: Help parsing XML

Fredrik Lundh fredrik at pythonware.com
Wed Mar 30 12:25:16 CEST 2005


Walter Underwood wrote:

> I'd just use a SAX interface. When you see id=HL as an attribute,
> close the old record and start a new one. Do the same thing at
> end of file. Done.
>
> Generally, if the structure is fairly fixed and you are extracting
> the data, think about using SAX. If the shape of the structure
> carries a lot of the information, you might need a DOM.

SAX is dead.  if you're not using higher-level APIs, you doing more work than
you have to, and your code is likely to be slower and buggier than it should be.

here's the (c)ElementTree iterparse version:

    try:
        import cElementTree as ET
    except ImportError:
        from elementtree import ElementTree as ET

    def process(record):
        # receives a list of elements for this record
        for elem in record:
            print elem.tag,
            elem.clear() # won't need this any more
        print

    record = []
    for event, elem in ET.iterparse("test.xml"):
        if elem.tag == "seg" and elem.get("id") == "HL":
            process(record)
            record = []
        record.append(elem)
    if record:
        process(record)

(the cElementTree version of iterparse is about 5 times faster than xml.sax
on the parsing part, and putting state in local variables and logic in the loop
body is a lot more efficient than putting state in instance variables and logic
in a bunch of callback methods).

here's a "functional" version of the same thing, btw:

    import cElementTree as ET
    from itertools import groupby
    from operator import itemgetter

    def source(file):
        # assign a unique serial to each HL group
        serial = 0
        for event, elem in ET.iterparse("test.xml"):
            if elem.tag == "seg" and elem.get("id") == "HL":
                serial += 1
            yield serial, elem

    for dummy, record in groupby(source("test.xml"), itemgetter(0)):
        # process record
        for dummy, elem in record:
            print elem,
            elem.clear()
        print

</F> 





More information about the XML-SIG mailing list