What's the best way to parse this HTML tag?

John Salerno johnjsal at gmail.com
Mon Mar 12 03:35:50 CET 2012


On Mar 11, 7:28 pm, Roy Smith <r... at panix.com> wrote:
> In article
> <239c4ad7-ac93-45c5-98d6-71a434e1c... at r21g2000yqa.googlegroups.com>,
>  John Salerno <johnj... at gmail.com> wrote:
>
>
>
>
>
>
>
>
>
> > Getting the time that the song is played is easy, because the time is
> > wrapped in a <div> tag all by itself with a class attribute that has a
> > specific value I can search for. But the actual song title and artist
> > information is harder, because the HTML isn't quite as precise. Here's
> > a sample:
>
> > <div class="cmPlaylistContent">
> >  <strong>
> >   <a href="/lsp/t2995/">
> >    Love Without End, Amen
> >   </a>
> >  </strong>
> >  <br/>
> >  <a href="/lsp/a436/">
> >   George Strait
> >  </a>
> > [...]
> > Therefore, I appeal to your greater wisdom in these matters. Given
> > this HTML, is there a "best practice" for how to refer to the song
> > title and artist?
>
> Obviously, any attempt at screen scraping is fraught with peril.
> Beautiful Soup is a great tool but it doesn't negate the fact that
> you've made a pact with the devil.  That being said, if I had to guess,
> here's your puppy:
>
> >   <a href="/lsp/t2995/">
> >    Love Without End, Amen
> >   </a>
>
> the thing to look for is an "a" element with an href that starts with
> "/lsp/t", where "t" is for "track".  Likewise:
>
> >  <a href="/lsp/a436/">
> >   George Strait
> >  </a>
>
> an href starting with "/lsp/a" is probably an artist link.
>
> You owe the Oracle three helpings of tag soup.

Well, I had considered exactly that method, but I don't know for sure
if the titles and names will always have links like that, so I didn't
want to tie my programming to something so specific. But perhaps it's
still better than just taking the first two strings.



More information about the Python-list mailing list