[Tutor] Regular Expression question

Scott Chapman scott_list@mischko.com
Mon Apr 21 00:15:08 2003


On Sunday 20 April 2003 12:41, Michael Janssen wrote:
> On Sun, 20 Apr 2003, Scott Chapman wrote:
> > It appears that the aggregation functions of parenthesis are limited. 
> > This does not work:
> >
> > <html([ \t]*>)|([ \t:].+?>)
>
> Seems to be a point, where re gets very sophisticated ;-)

Seems to me to be a point where re fails to get very sophisticated!

> You should try to explain in plain words what this re is expected to do
> (and each part of it). This often helps to find logical mistakes
> (Pointer: "ab|c" doesn't look for ab or ac but for ab or c).

This is actually supposed to be quite simple:
<html 
followed by either:
[ \t]*>
or
[ \t:].+?>

I wanted to see if parenthesis would work for grouping in this context.  They 
don't work here.

> On the other hand, you should possibly use a different overall design:
> instead of one very tricky all-purpose regexp you should break your html
> in first step into tags, then get the type of the tag then the attributes
> then the values of the attributes (or any order like this).

I'm not seriously considering using RE to parse the HTML.  I'm already looking 
into HTMLParser.  I don't like it very well but I can make it work.  This 
stuff is to help me learn re at this point.

> This way, you look for a complete tag and give it to a function to analyse
> this tag further (look if it match certain conditions or retrieve the
> content). When I would try to "simply" parse html without a complete
> htmlParser, I would do it this way.
>
> > test(r'<html[([ \t]*)([ \t:].+?)]>')
> >
> > <html blah>
> > Traceback (most recent call last):
>
> ...
>
> > sre_constants.error: unbalanced parenthesis
>
> []-brackets build a character set. With '[([ \t]*)([ \t:].+?)]' they are
> simply missused. '(([ \t]*)([ \t:].+?))' is working but a noop for
> aggregation (not for retrieving values).

My point in trying this was again to see how useful the parenthesis are for 
grouping.  I understand that []-brackets are used to indicate "one of the 
contents of the brackets".

I'd like to see parenthesis work for grouping on either side of a | (boolean 
OR indicator) and inside of square brackets to allow selection of one among a 
group of items in parenthesis; again a boolean OR situation. No go!  Oh well.  
Maybe the language guru's can add it in at some point.  I'm not savvy enough 
about re to know if this is a good idea or not.  It would make it possible to 
do some very nice stuff with re that's currently not this simple.  I think it 
would make re a lot more powerful.

Scott