Regular expression to structure HTML

Stefan Behnel stefan_ml at behnel.de
Fri Oct 2 14:32:56 CEST 2009


Paul McGuire wrote:
> On Oct 2, 12:10 am, "504cr... at gmail.com" <504cr... at gmail.com> wrote:
>> I'm kind of new to regular expressions, and I've spent hours trying to
>> finesse a regular expression to build a substitution.
>>
>> What I'd like to do is extract data elements from HTML and structure
>> them so that they can more readily be imported into a database.
> 
> Oy! If I had a nickel for every misguided coder who tried to scrape
> HTML with regexes...
> 
> Some reasons why RE's are no good at parsing HTML:
> - tags can be mixed case
> - tags can have whitespace in many unexpected places
> - tags with no body can combine opening and closing tag with a '/'
> before the closing '>', as in "<BR/>"
> - tags can have attributes that you did not expect (like "<BR
> CLEAR=ALL>")
> - attributes can occur in any order within the tag
> - attribute names can also be in unexpected upper/lower case
> - attribute values can be enclosed in double quotes, single quotes, or
> even (surprise!) NO quotes

BTW, BeautifulSoup's parser also uses regexes, so if the OP used it, he/she
could claim to have solved the problem "with regular expressions" without
even lying.

Stefan



More information about the Python-list mailing list