A Unique XML Parsing Problem

Devon dshurick at gmail.com
Sat Oct 23 19:40:20 EDT 2010


I must quickly and efficiently parse some data contained in multiple
XML files in order to perform some learning algorithms on the data.
Info:

I have thousands of files, each file corresponds to a single song.
Each XML file contains information extracted from the song (called
features). Examples include tempo, time signature, pitch classes, etc.
An example from the beginning of one of these files looks like:

<analysis decoder="Quicktime" version="0x7608000">
    <track duration="29.12331" endOfFadeIn="0.00000"
startOfFadeOut="29.12331" loudness="-12.097" tempo="71.031"
tempoConfidence="0.386" timeSignature="4"
timeSignatureConfidence="0.974" key="11" keyConfidence="1.000"
mode="0" modeConfidence="1.000">
        <sections>
            <section start="0.00000" duration="7.35887"/>
            <section start="7.35887" duration="13.03414"/>
            <section start="20.39301" duration="8.73030"/>
        </sections>
        <segments>
            <segment start="0.00000" duration="0.56000">
                <loudness>
                    <dB time="0">-60.000</dB>
                    <dB time="0.45279" type="max">-59.897</dB>
                </loudness>
                <pitches>
                    <pitch class="0">0.589</pitch>
                    <pitch class="1">0.446</pitch>
                    <pitch class="2">0.518</pitch>
                    <pitch class="3">1.000</pitch>
                    <pitch class="4">0.850</pitch>
                    <pitch class="5">0.414</pitch>
                    <pitch class="6">0.326</pitch>
                    <pitch class="7">0.304</pitch>
                    <pitch class="8">0.415</pitch>
                    <pitch class="9">0.566</pitch>
                    <pitch class="10">0.353</pitch>
                    <pitch class="11">0.350</pitch>

I am a statistician and therefore used to data being stored in CSV-
like files, with each row being a datapoint, and each column being a
feature. I would like to parse the data out of these XML files and
write them out into a CSV file. Any help would be greatly appreciated.
Mostly I am looking for a point in the right direction. I have heard
about Beautiful Soup but never used it. I am currently reading Dive
Into Python's chapters on HTML and XML parsing. And I am also more
concerned about how to use the tags in the XML files to build feature
names so I do not have to hard code them. For example, the first
feature given by the above code would be "track duration" with a value
of 29.12331

Thanks,

-Devon



More information about the Python-list mailing list