What's the best way to write this regular expression?

John Salerno johnjsal at gmail.com
Tue Mar 6 17:43:34 EST 2012


I sort of have to work with what the website gives me (as you'll see below), but today I encountered an exception to my RE. Let me just give all the specific information first. The point of my script is to go to the specified URL and extract song information from it.

This is my RE:

song_pattern = re.compile(r'([0-9]{1,2}:[0-9]{2} [a|p].m.).*?<a.*?>(.*?)</a>.*?<a.*?>(.*?)</a>', re.DOTALL)

This is how the website is formatted:

4:25 p.m.
                </div><div class="cmPlaylistContent"><strong><a href="/lsp/t24435/">AP TX SOC CPAS TRF</a></strong><br /><br /></div></li><li ><div class="cmPlaylistTime">
                
                4:21 p.m.
                </div><div class="cmPlaylistContent"><strong><a href="/lsp/t7672/">No One Else On Earth</a></strong><br /><a href="/lsp/a1924/">Wynonna</a><br /></div></li><li ><div class="cmPlaylistTime">
                
                4:19 p.m.
                </div><div class="cmPlaylistImage"><img src="http://media.cmgdigital.com/shared/amg/pic200/drp100/p109/p10901ruw7x_r85x85.jpg?998f84231a014ed68123ddb508af9480570dc122" alt="Moe Bandy" class="cmDarkBoxShadow cmPhotoBorderWhite"/></div><div class="cmPlaylistContent"><strong><a href="/lsp/t15101/">It' A Cheating Situation</a></strong><br /><a href="/lsp/a5307/">Moe Bandy</a><br /><span class="sprite iconVoteUp">Votes  (1) </span></div></li><li ><div class="cmPlaylistTime">
                
                4:15 p.m.
                </div><div class="cmPlaylistImage"><img src="http://media.cmgdigital.com/shared/amg/pic200/drp700/p744/p74493d85qy_r85x85.jpg?998f84231a014ed68123ddb508af9480570dc122" alt="Reba McEntire" class="cmDarkBoxShadow cmPhotoBorderWhite"/></div><div class="cmPlaylistContent"><strong><a href="/lsp/t14437/">Somebody Should Leave</a></strong><br /><a href="/lsp/a396/">REBA McENTIRE</a> & <a href="/lsp/a5765/">LINDA DAVIS</a><br /></div></li><li ><div class="cmPlaylistTime">

There's something of a pattern, although it's not always perfect. The time is listed first, and then the song information in <a> tags. However, in this particular case, you can see that for the 4:25pm entry, "AP TX SOC CPAS TRF" is extracted for the song title, and then the RE skips to the next entry in order to find the next <a> tags, which is actually the name of the next song in the list, instead of being the artist as normal. (Of course, I have no idea what AP TX SOC CPAS TRF is anyway. Usually the website doesn't list commercials or anomalies like that.)

So my first question is basic: am I even extracting the information properly? It works almost all the time, but because the website is such a mess, I pretty much have to rely on the tags being in the proper places (as they were NOT in this case!).

The second question is, to fix the above problem, would it be sufficient to rewrite my RE so that it has to find all of the specified information, i.e. a time followed by two <a> entries, BEFORE it moves on to finding the next time? I think that would have caused it to skip the 4:25 entry above, and only extract entries that have a time followed by two <a> entries (song and artist).

If this is possible, how do I rewrite it so that it has to match all the conditions without skipping over the next time entry in order to do so?

Thanks.



More information about the Python-list mailing list