What's the best way to parse this HTML tag?
John Salerno
johnjsal at gmail.com
Sun Mar 11 18:53:47 EDT 2012
I'm using Beautiful Soup to extract some song information from a radio
station's website that lists the songs it plays as it plays them.
Getting the time that the song is played is easy, because the time is
wrapped in a <div> tag all by itself with a class attribute that has a
specific value I can search for. But the actual song title and artist
information is harder, because the HTML isn't quite as precise. Here's
a sample:
<div class="cmPlaylistContent">
<strong>
<a href="/lsp/t2995/">
Love Without End, Amen
</a>
</strong>
<br/>
<a href="/lsp/a436/">
George Strait
</a>
<br/>
<span class="sprite iconDownload">
</span>
Download Song:
<a href="http://itunes.apple.com/us/album/love-without-end-amen/
id71416?i=71404&uo=4">
iTunes
</a>
|
<a href="http://www.amazon.com/Love-Without-End-Amen/dp/B000V638BQ?
SubscriptionId=1NXYFBZST44V8CCDK182&tag=coxradiointer-20&linkCode=xm2&camp=2025&creative=165953&creativeASIN=B000V638BQ">
Amazon MP3
</a>
<br/>
<span class="sprite iconComments">
Comments (1)
</span>
<span class="sprite iconVoteUp">
Votes (1)
</span>
</div>
This is about as far as I can drill down without getting TOO specific.
I simply find the <div> tags with the "cmPlaylistContent" class. This
tag contains both the song title and the artist name, and sometimes
miscellaneous other information as well, like a way to vote for the
song or links to purchase it from iTunes or Amazon.
So my question is, given the above HTML, how can I best extract the
song title and artist name? It SEEMS like they are always the first
two pieces of information in the tag, such that:
for item in div.stripped_strings: print(item)
Love Without End, Amen
George Strait
Download Song:
iTunes
|
Amazon MP3
Comments (1)
Votes (1)
and I could simply get the first two items returned by that generator.
It's not quite as clean as I'd like, because I have no idea if
anything could ever be inserted before either of these items, thus
messing it all up.
I also don't want to rely on the <strong> tag, which makes me shudder,
or the <a> tag, because I don't know if they will always have an href.
Ideall, the <a> tag would have also had an attribute that labeled the
title as the title, and the artist as the artist, but alas.....
Therefore, I appeal to your greater wisdom in these matters. Given
this HTML, is there a "best practice" for how to refer to the song
title and artist?
Thanks!
More information about the Python-list
mailing list