stripping fields from xml file into a csv

Hai Vu wuhrrr at gmail.com
Sun Feb 28 05:15:43 CET 2010


On Feb 27, 12:50 pm, Hal Styli <silly... at yahoo.com> wrote:
> Hello,
>
> Can someone please help.
> I have a sed solution to the problems below but would like to rewrite
> in python...
>
> I need to strip out some data from a quirky xml file into a csv:
>
> from something like this
>
> < ..... cust="dick" .... product="eggs" ... quantity="12" .... >
> < .... cust="tom" .... product="milk" ... quantity="2" ...>
> < .... cust="harry" .... product="bread" ... quantity="1" ...>
> < .... cust="tom" .... product="eggs" ... quantity="6" ...>
> < ..... cust="dick" .... product="eggs" ... quantity="6" .... >
>
> to this
>
> dick,eggs,12
> tom,milk,2
> harry,bread,1
> tom,eggs,6
> dick,eggs,6
>
> I am new to python and xml and it would be great to see some slick
> ways of achieving the above by using python's XML capabilities to
> parse the original file or python's regex to achive what I did using
> sed.
>
> Thanks for any constructive help given.
>
> Hal

Here is a sample XML file (I named it data.xml):
--------------------------
<orders>
	<order customer="john" product="eggs" quantity="12" />
	<order customer="cindy" product="bread" quantity="1" />
	<order customer="larry" product="tea bags" quantity="100" />
	<order customer="john" product="butter" quantity="1" />
	<order product="chicken" quantity="2" customer="derek" />
</orders>
--------------------------

Code:
--------------------------
import csv
import xml.sax

# Handle the XML file with the following structure:
# <orders>
#   <order attributes... /> ...
# </orders>
class OrdersHandler(xml.sax.handler.ContentHandler):
    def __init__(self, csvfile):
        # Open a csv file for output
        self.csvWriter = csv.writer(open(csvfile, 'w'))

    def startElement(self, name, attributes):
        # Only process the <order ... > element
        if name == 'order':
            # Construct a sorted list of attribute names in order to
            # guarantee rows are written in the same order. We assume
            # the XML elements contain the same attributes
            attributeNames = attributes.getNames()
            attributeNames.sort()

            # Construct a row and write it to the csv file
            row = []
            for name in attributeNames:
                row.append(attributes.getValue(name))
            self.csvWriter.writerow(row)

    def endDocument(self):
        # Destroy the csv writer object to close the file
        self.csvWriter = None

# Main
datafile = 'data.xml'
csvfile = 'data.csv'
ordersHandler = OrdersHandler(csvfile)
xml.sax.parse(datafile, ordersHandler)
--------------------------

To solve your problem, it is easier to use SAX than DOM. Basically,
use SAX to scan the XML file, if you encounter the element you like
(in this case <order ...>) then you process its attributes. In this
case, you sort the attributes, then write to a csv file.

--------------------------

References:

SAX Parser:
    http://docs.python.org/library/xml.sax.html

SAX Content Handler:
    http://docs.python.org/library/xml.sax.handler.html

Attributes Object:
    http://docs.python.org/library/xml.sax.reader.html#attributes-objects




More information about the Python-list mailing list