What's the best way to write this regular expression?

John Salerno johnjsal at gmail.com
Wed Mar 7 00:02:42 CET 2012


On Tuesday, March 6, 2012 4:52:10 PM UTC-6, Chris Rebert wrote:
> On Tue, Mar 6, 2012 at 2:43 PM, John Salerno <johnjsal at gmail.com> wrote:
> > I sort of have to work with what the website gives me (as you'll see below), but today I encountered an exception to my RE. Let me just give all the specific information first. The point of my script is to go to the specified URL and extract song information from it.
> >
> > This is my RE:
> >
> > song_pattern = re.compile(r'([0-9]{1,2}:[0-9]{2} [a|p].m.).*?<a.*?>(.*?)</a>.*?<a.*?>(.*?)</a>', re.DOTALL)
> 
> I would advise against using regular expressions to "parse" HTML:
> http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
> 
> lxml is a popular choice for parsing HTML in Python: http://lxml.de
> 
> Cheers,
> Chris

Thanks, that was an interesting read :)

Anything that allows me NOT to use REs is welcome news, so I look forward to learning about something new! :)



More information about the Python-list mailing list