stripping fields from xml file into a csv

Stefan Behnel stefan_ml at behnel.de
Sun Feb 28 03:05:11 EST 2010


Hal Styli, 27.02.2010 21:50:
> I have a sed solution to the problems below but would like to rewrite
> in python...

Note that sed (or any other line based or text based tool) is not a
sensible way to handle XML. If you want to read XML, use an XML parser.
They are designed to do exactly what you want in a standard compliant way,
and they can deal with all sorts of XML formatting and encoding, for example.


> I need to strip out some data from a quirky xml file into a csv:
> 
> from something like this
> 
> < ..... cust="dick" .... product="eggs" ... quantity="12" .... >
> < .... cust="tom" .... product="milk" ... quantity="2" ...>
> < .... cust="harry" .... product="bread" ... quantity="1" ...>
> < .... cust="tom" .... product="eggs" ... quantity="6" ...>
> < ..... cust="dick" .... product="eggs" ... quantity="6" .... >

As others have noted, this doesn't tell much about your XML. A more
complete example would be helpful.


> to this
> 
> dick,eggs,12
> tom,milk,2
> harry,bread,1
> tom,eggs,6
> dick,eggs,6
> 
> I am new to python and xml and it would be great to see some slick
> ways of achieving the above by using python's XML capabilities to
> parse the original file or python's regex to achive what I did using
> sed.

It's funny how often people still think that SAX is a good way to solve XML
problems. Here's an untested solution that uses xml.etree.ElementTree:

    from xml.etree import ElementTree as ET

    csv_field_order = ['cust', 'product', 'quantity']

    clean_up_used_elements = None
    for event, element in ET.iterparse("thefile.xml", events=['start']):
        # you may want to select a specific element.tag here

        # format and print the CSV line to the standard output
        print(','.join(element.attrib.get(title, '')
                       for title in csv_field_order))

        # safe some memory (in case the XML file is very large)
        if clean_up_used_elements is None:
            # this assigns the clear() method of the root (first) element
            clean_up_used_elements = element.clear
        clean_up_used_elements()

You can strip everything dealing with 'clean_up_used_elements' (basically
the last section) if your XML file is small enough to fit into memory (a
couple of MB is usually fine).

Stefan




More information about the Python-list mailing list