[Pythonmac-SIG] XML Parsing

Bryan Smith bryanabsmith at gmail.com
Sat Feb 7 21:09:35 CET 2009


"Not clear from your question whether your goal is to learn to parse XML in
python
or to solve a particular problem. If your goal is to learn python XML
processing,
then go right ahead -- however, it looks like you are using SAX below, and
the sort
of thing you describe might be done better using a DOM parser ( or maybe
etree )" - It's a bit of both - learning XML parsing through solving a
problem. I started with SAX because that's how the book I have does it.

I have looked up ElementTree and this looks like a much easier and much more
elegant solution to my problem.

"Not that it can't be done in SAX -- it's just that, as you discovered, low
level
 SAX parsing requires that you keep track of the containment hierarchy
yourself,
 which is a lot of work to solve a simple problem." - I see now that I was
doing a lot more work than I really needed to to accomplish my goal.

Thanks a lot Steve for the in-depth (from my perspective) explanation of all
the solutions available to me. I appreciate the help.

Bryan

On Sat, Feb 7, 2009 at 2:18 AM, Steve Majewski <sdm7g at mac.com> wrote:

>
> Not clear from your question whether your goal is to learn to parse XML in
> python
> or to solve a particular problem. If your goal is to learn python XML
> processing,
> then go right ahead -- however, it looks like you are using SAX below, and
> the sort
> of thing you describe might be done better using a DOM parser ( or maybe
> etree )
>
> If what you want is not just to select some info from the xml file, but to
> get it
> into a Python object so that you can then manipulate it further, then DOM
> or etree
> is also probably a better model. It will parse the XML ( likely using SAX
> underneath )
> and give you an object that encodes the whole file.
>
> [ Not that it can't be done in SAX -- it's just that, as you discovered,
> low level
>  SAX parsing requires that you keep track of the containment hierarchy
> yourself,
>  which is a lot of work to solve a simple problem. ]
>
>
> If you're just trying to work with XML, then most folks don't write XML
> parsers for
> that sort of thing, but use higher level tools: XSLT, XPATH and or XQUERY.
>
> The Mac has xsltproc as a built-in xslt (1.0) processor.
> There is a xpath program written in perl in Leopard/10.5. ( /usr/bin/xpath
> )
> And Saxon is easily downloaded and does xslt 2.0 and xquery 1.0 .
>
>
> The following XSLT 1.0 stylesheet:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
> version="1.0">
> <xsl:output method="text"/>
>
> <xsl:template match="/">
>    <xsl:apply-templates select="/topalbums/album[@rank &lt; 6]"/>
>    <!-- just select the top 5 albums -->
> </xsl:template>
>
> <xsl:template match="/topalbums/album" >
>  album: <xsl:value-of select="name"/>
>  artist: <xsl:value-of select="artist/name"/>
>  count=<xsl:value-of select="playcount"/>
>  <xsl:text>
>  </xsl:text> <!-- this is here to insert the blank line break -->
> </xsl:template>
>
> </xsl:stylesheet>
>
>
> Will, when run on that file, produce this output:
> ~$ xsltproc Untitled1.xsl  topalbums.xml
>
>  album: Vheissu
>  artist: Thrice
>  count=332
>
>  album: The Artist in the Ambulance
>  artist: Thrice
>  count=289
>
>  album: Appeal To Reason
>  artist: Rise Against
>  count=286
>
>  album: Favourite Worst Nightmare
>  artist: Arctic Monkeys
>  count=210
>
>  album: The Sufferer & The Witness
>  artist: Rise Against
>  count=206
>
> [ Not sure if that's anything like what you want. ]
>
>
> I'm sure that the whole thing would reduce to an even more concise XQuery
> request.
>
> I was trying to do the whole thing as an xpath one liner, but it didn't
> like
> my attempts to include alternates in parenthesis. I think this is an xpath
> 1.0
> vs. xpath 2.0 issue. Saxon is the only thing that supports 2.0. The perl,
> python
> and java libraries only support xpath 1.0.
>
> This sort of expression did work using xpath 2.0 (in oxygen editor):
>
>        //album[@rank < 6]/(name|playcount|artist/name)
>
> But I couldn't figure out a 1.0 syntax that would grab all three fields.
>
> ( and the perl xpath seems to have a bug that interprets '@rank < 6' as
> less-than-or-equal! )
>
>
> -- Steve Majewski
>
>
>
>
> On Feb 6, 2009, at 11:00 PM, Bryan Smith wrote:
>
>  Hi everyone,
>>
>> I have another question I'm hoping someone would be kind enough to answer.
>> I am new to parsing XML (not to mention much of Python itself) and I am
>> trying to parse an XML file. The file I am trying to parse is this one:
>> http://ws.audioscrobbler.com/2.0/user/bryansmith/topalbums.xml.
>>
>> So far, I have written up a class for parsing this file in my attempts to
>> present to the user a list of top albums on their last.fm profile. If you
>> note, the artist name and album name are both signified by the <name> tag
>> which makes my job harder. If the tag names were different, I wouldn't have
>> a problem. Listed below is the class I have written to parse the file. My
>> question then is this: is there a way I can say something like "if tag_name
>> == album name tag then....elif tag_name == artist name tag....". I hope this
>> is clear.
>>
>> As it stands right now, if I parse this file and print the results, this
>> is what I get (understandably) if I try to print out in the following
>> fashion - album (playcount): Vheissu (332), Thrice (289), The Artist in the
>> Ambulance (286), Thrice (210) and so on. Thrice is the artist name. I want
>> to be able to differentiate between the "artist" name tag and the "album"
>> name tag.
>>
>>
>> Class as it stands right now:
>>
>> class GetTopAlbums(ContentHandler):
>>
>>    in_album_tag = False
>>    in_playcount_tag = False
>>
>>    def __init__(self, album, playcount):
>>        ContentHandler.__init__(self)
>>        self.album = album
>>        self.playcount = playcount
>>        self.data = []
>>
>>    def startElement(self, tag_name, attr):
>>        if tag_name == "name":
>>            self.in_album_tag = True
>>        elif tag_name == "playcount":
>>            self.in_playcount_tag = True
>>
>>    def endElement(self, tag_name):
>>        if tag_name == "name":
>>            content = "".join(self.data)
>>            self.data = []
>>            self.album.append(content)
>>            self.in_album_tag = False
>>        elif tag_name == "playcount":
>>            content = "".join(self.data)
>>            self.data = []
>>            self.playcount.append(content)
>>            self.in_playcount_tag = False
>>
>>    def characters(self, string):
>>        if self.in_album_tag == True:
>>            self.data.append(string)
>>        elif self.in_playcount_tag == True:
>>            self.data.append(string)
>>
>> Thanks in advance!
>> Bryan
>> _______________________________________________
>> Pythonmac-SIG maillist  -  Pythonmac-SIG at python.org
>> http://mail.python.org/mailman/listinfo/pythonmac-sig
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pythonmac-sig/attachments/20090207/1568cd58/attachment.htm>


More information about the Pythonmac-SIG mailing list