Regular expression to structure HTML

Nobody nobody at nowhere.com
Sun Oct 4 21:32:37 EDT 2009


On Thu, 01 Oct 2009 22:10:55 -0700, 504crank at gmail.com wrote:

> I'm kind of new to regular expressions

The most important thing to learn about regular expressions is to learn
what they can do, what they can't do, and what they can do in theory but
can't do in practice (usually because of exponential or combinatorial
growth).

One thing they can't do is to match any kind of construct which has
arbitrary nesting. E.g. you can't match any class of HTML element which
can self-nest or whose children can self-nest. In practice, this means you
can only match a handful of elements which are either empty (e.g. <img>)
or which can only contain CDATA (e.g. <script>, <style>).

You can match individual tags, although getting it right is quite hard;
simply using <[^>]*> fails if any of the attribute values contain a >
character.

> What I'd like to do is extract data elements from HTML and structure
> them so that they can more readily be imported into a database.

If you want to extract entire elements from arbitrary HTML, you have to
use a real parser which can handle recursion, e.g. a recursive-descent
parser or a push-down automaton.

You can use regexps to match individual tags. If you only need to parse a
very specific subset of HTML (i.e. the pages are all generated from a
common template), you may even be able to match some entire elements using
regexps.




More information about the Python-list mailing list