A Unique XML Parsing Problem

Piet van Oostrum piet at vanoostrum.org
Sun Oct 24 15:44:28 CEST 2010


Devon <dshurick at gmail.com> writes:

> I must quickly and efficiently parse some data contained in multiple
> XML files in order to perform some learning algorithms on the data.
> Info:
>
> I have thousands of files, each file corresponds to a single song.
> Each XML file contains information extracted from the song (called
> features). Examples include tempo, time signature, pitch classes, etc.
> An example from the beginning of one of these files looks like:
>
> <analysis decoder="Quicktime" version="0x7608000">
>     <track duration="29.12331" endOfFadeIn="0.00000"
> startOfFadeOut="29.12331" loudness="-12.097" tempo="71.031"
> tempoConfidence="0.386" timeSignature="4"
> timeSignatureConfidence="0.974" key="11" keyConfidence="1.000"
> mode="0" modeConfidence="1.000">
>         <sections>
>             <section start="0.00000" duration="7.35887"/>
>             <section start="7.35887" duration="13.03414"/>
>             <section start="20.39301" duration="8.73030"/>
>         </sections>
>         <segments>
>             <segment start="0.00000" duration="0.56000">
>                 <loudness>
>                     <dB time="0">-60.000</dB>
>                     <dB time="0.45279" type="max">-59.897</dB>
>                 </loudness>
>                 <pitches>
>                     <pitch class="0">0.589</pitch>
>                     <pitch class="1">0.446</pitch>
>                     <pitch class="2">0.518</pitch>
>                     <pitch class="3">1.000</pitch>
>                     <pitch class="4">0.850</pitch>
>                     <pitch class="5">0.414</pitch>
>                     <pitch class="6">0.326</pitch>
>                     <pitch class="7">0.304</pitch>
>                     <pitch class="8">0.415</pitch>
>                     <pitch class="9">0.566</pitch>
>                     <pitch class="10">0.353</pitch>
>                     <pitch class="11">0.350</pitch>
>

You could use XSLT to get the data. For example this xslt script extracts duration, tempo and time signature into a comma separated list. 

<xsl:stylesheet version="1.0" 
		xmlns:xs="http://www.w3.org/2001/XMLSchema"          
		xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="text"/>
  <xsl:strip-space elements="*"/> 
  <xsl:template match="/analysis/track">
    <xsl:value-of select="concat(@duration, ',', @tempo, ',',
      @timeSignature)" /><xsl:text>&#x0A;</xsl:text>
  </xsl:template>
</xsl:stylesheet>

With xsltproc song.xsl song*.xml you would get your output.
No python necessary. Or if you would like to use it inside a Python program, use lxml to call the xslt processor, or just XPath to extract the values and format them with Python.



More information about the Python-list mailing list