A Unique XML Parsing Problem
Piet van Oostrum
piet at vanoostrum.org
Sun Oct 24 09:44:28 EDT 2010
Devon <dshurick at gmail.com> writes:
> I must quickly and efficiently parse some data contained in multiple
> XML files in order to perform some learning algorithms on the data.
> Info:
>
> I have thousands of files, each file corresponds to a single song.
> Each XML file contains information extracted from the song (called
> features). Examples include tempo, time signature, pitch classes, etc.
> An example from the beginning of one of these files looks like:
>
> <analysis decoder="Quicktime" version="0x7608000">
> <track duration="29.12331" endOfFadeIn="0.00000"
> startOfFadeOut="29.12331" loudness="-12.097" tempo="71.031"
> tempoConfidence="0.386" timeSignature="4"
> timeSignatureConfidence="0.974" key="11" keyConfidence="1.000"
> mode="0" modeConfidence="1.000">
> <sections>
> <section start="0.00000" duration="7.35887"/>
> <section start="7.35887" duration="13.03414"/>
> <section start="20.39301" duration="8.73030"/>
> </sections>
> <segments>
> <segment start="0.00000" duration="0.56000">
> <loudness>
> <dB time="0">-60.000</dB>
> <dB time="0.45279" type="max">-59.897</dB>
> </loudness>
> <pitches>
> <pitch class="0">0.589</pitch>
> <pitch class="1">0.446</pitch>
> <pitch class="2">0.518</pitch>
> <pitch class="3">1.000</pitch>
> <pitch class="4">0.850</pitch>
> <pitch class="5">0.414</pitch>
> <pitch class="6">0.326</pitch>
> <pitch class="7">0.304</pitch>
> <pitch class="8">0.415</pitch>
> <pitch class="9">0.566</pitch>
> <pitch class="10">0.353</pitch>
> <pitch class="11">0.350</pitch>
>
You could use XSLT to get the data. For example this xslt script extracts duration, tempo and time signature into a comma separated list.
<xsl:stylesheet version="1.0"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/analysis/track">
<xsl:value-of select="concat(@duration, ',', @tempo, ',',
@timeSignature)" /><xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
With xsltproc song.xsl song*.xml you would get your output.
No python necessary. Or if you would like to use it inside a Python program, use lxml to call the xslt processor, or just XPath to extract the values and format them with Python.
More information about the Python-list
mailing list