Parsing HTML tags with 're'

Sun Aug 8 15:13:55 EDT 1999

I'm trying to parse HTML using Python's 're' module.  Generally things
are working fine, but I have one small problem left, and I'd also like to
know if I'm just going the wrong direction entirely.

I have a set of expressions I use for different stages of my parsing
process.  These include:

TagStart   = re.compile( '<[a-zA-Z]|<!--|</[a-zA-Z]' )
TagEnd     = re.compile( '>' )
CommentEnd = re.compile( '-->' )

I use these to find the start and end of tags.  Since I parse line by
line, I can't assume the tag will end in the same string as it started
in, so I don't look for '<*>' or anything similar.  I look for a tag
start, and then I buffer lines until I find a matching tag end.  Once I
have a complete tag, I want to parse out the tag type and the argument
list where appropriate (i.e. not in comments, and end tags don't have
arguments).  To do this parsing, I use these expressions:

Structure = re.compile('^<(?P<type>[a-zA-Z_/]\w*)\s*(?P<args>[^>]+)*>')
Arglist   = re.compile('(?P<name>[^=]+)=?(?P<value>.+)?')

The first one separates out the tag type and the optional argument list,
and the second one parses out the individual arguments.

The only problem I'm having is with the arglist parser; if an argument
looks like, arg="quoted arg with spaces", the spaces cause the argument
to break up.  What's the best way to fix this?

As I said before, I would also be interested to know if there is a much
better way to do this.  Did I miss a standard HTML parsing Python module?
If this approach is reasonable, feel free to scoop the code for your own
nefarious purposes.

- Bruce

Sent via Deja.com http://www.deja.com/
Share what you know. Learn what you don't.