[Tutor] Regular Expression question

Michael Janssen Janssen@rz.uni-frankfurt.de
Sun Apr 20 15:42:01 2003


On Sun, 20 Apr 2003, Scott Chapman wrote:

> It appears that the aggregation functions of parenthesis are limited.  This
> does not work:
>
> <html([ \t]*>)|([ \t:].+?>)

Seems to be a point, where re gets very sophisticated ;-)

You should try to explain in plain words what this re is expected to do
(and each part of it). This often helps to find logical mistakes
(Pointer: "ab|c" doesn't look for ab or ac but for ab or c).

On the other hand, you should possibly use a different overall design:
instead of one very tricky all-purpose regexp you should break your html
in first step into tags, then get the type of the tag then the attributes
then the values of the attributes (or any order like this).

This way, you look for a complete tag and give it to a function to analyse
this tag further (look if it match certain conditions or retrieve the
content). When I would try to "simply" parse html without a complete
htmlParser, I would do it this way.

> test(r'<html[([ \t]*)([ \t:].+?)]>')
>
> <html blah>
> Traceback (most recent call last):
...
> sre_constants.error: unbalanced parenthesis

[]-brackets build a character set. With '[([ \t]*)([ \t:].+?)]' they are
simply missused. '(([ \t]*)([ \t:].+?))' is working but a noop for
aggregation (not for retrieving values).

Michael