What's the best way to parse this HTML tag?

John Salerno johnjsal at gmail.com
Sun Mar 11 18:53:47 EDT 2012


I'm using Beautiful Soup to extract some song information from a radio
station's website that lists the songs it plays as it plays them.
Getting the time that the song is played is easy, because the time is
wrapped in a <div> tag all by itself with a class attribute that has a
specific value I can search for. But the actual song title and artist
information is harder, because the HTML isn't quite as precise. Here's
a sample:

<div class="cmPlaylistContent">
 <strong>
  <a href="/lsp/t2995/">
   Love Without End, Amen
  </a>
 </strong>
 <br/>
 <a href="/lsp/a436/">
  George Strait
 </a>
 <br/>
 <span class="sprite iconDownload">
 </span>
 Download Song:
 <a href="http://itunes.apple.com/us/album/love-without-end-amen/
id71416?i=71404&uo=4">
  iTunes
 </a>
 |
 <a href="http://www.amazon.com/Love-Without-End-Amen/dp/B000V638BQ?
SubscriptionId=1NXYFBZST44V8CCDK182&tag=coxradiointer-20&linkCode=xm2&camp=2025&creative=165953&creativeASIN=B000V638BQ">
  Amazon MP3
 </a>
 <br/>
 <span class="sprite iconComments">
  Comments  (1)
 </span>
 <span class="sprite iconVoteUp">
  Votes  (1)
 </span>
</div>

This is about as far as I can drill down without getting TOO specific.
I simply find the <div> tags with the "cmPlaylistContent" class. This
tag contains both the song title and the artist name, and sometimes
miscellaneous other information as well, like a way to vote for the
song or links to purchase it from iTunes or Amazon.

So my question is, given the above HTML, how can I best extract the
song title and artist name? It SEEMS like they are always the first
two pieces of information in the tag, such that:

for item in div.stripped_strings: print(item)

Love Without End, Amen
George Strait
Download Song:
iTunes
|
Amazon MP3
Comments  (1)
Votes  (1)

and I could simply get the first two items returned by that generator.
It's not quite as clean as I'd like, because I have no idea if
anything could ever be inserted before either of these items, thus
messing it all up.

I also don't want to rely on the <strong> tag, which makes me shudder,
or the <a> tag, because I don't know if they will always have an href.
Ideall, the <a> tag would have also had an attribute that labeled the
title as the title, and the artist as the artist, but alas.....

Therefore, I appeal to your greater wisdom in these matters. Given
this HTML, is there a "best practice" for how to refer to the song
title and artist?

Thanks!



More information about the Python-list mailing list