Regular expression to structure HTML
ptmcg at austin.rr.com
Fri Oct 2 10:22:24 CEST 2009
On Oct 2, 12:10 am, "504cr... at gmail.com" <504cr... at gmail.com> wrote:
> I'm kind of new to regular expressions, and I've spent hours trying to
> finesse a regular expression to build a substitution.
> What I'd like to do is extract data elements from HTML and structure
> them so that they can more readily be imported into a database.
Oy! If I had a nickel for every misguided coder who tried to scrape
HTML with regexes...
Some reasons why RE's are no good at parsing HTML:
- tags can be mixed case
- tags can have whitespace in many unexpected places
- tags with no body can combine opening and closing tag with a '/'
before the closing '>', as in "<BR/>"
- tags can have attributes that you did not expect (like "<BR
- attributes can occur in any order within the tag
- attribute names can also be in unexpected upper/lower case
- attribute values can be enclosed in double quotes, single quotes, or
even (surprise!) NO quotes
For HTML that is machine-generated, you *may* be able to make some
page-specific assumptions. But if edited by human hands, or if you
are trying to make a generic page scraper, RE's will never cut it.
More information about the Python-list