Regular expression to structure HTML

Paul McGuire ptmcg at austin.rr.com
Fri Oct 2 04:22:24 EDT 2009


On Oct 2, 12:10 am, "504cr... at gmail.com" <504cr... at gmail.com> wrote:
> I'm kind of new to regular expressions, and I've spent hours trying to
> finesse a regular expression to build a substitution.
>
> What I'd like to do is extract data elements from HTML and structure
> them so that they can more readily be imported into a database.

Oy! If I had a nickel for every misguided coder who tried to scrape
HTML with regexes...

Some reasons why RE's are no good at parsing HTML:
- tags can be mixed case
- tags can have whitespace in many unexpected places
- tags with no body can combine opening and closing tag with a '/'
before the closing '>', as in "<BR/>"
- tags can have attributes that you did not expect (like "<BR
CLEAR=ALL>")
- attributes can occur in any order within the tag
- attribute names can also be in unexpected upper/lower case
- attribute values can be enclosed in double quotes, single quotes, or
even (surprise!) NO quotes

For HTML that is machine-generated, you *may* be able to make some
page-specific assumptions.  But if edited by human hands, or if you
are trying to make a generic page scraper, RE's will never cut it.

-- Paul




More information about the Python-list mailing list