[Tutor] Regular Expression question
Mon Apr 21 00:15:08 2003
On Sunday 20 April 2003 12:41, Michael Janssen wrote:
> On Sun, 20 Apr 2003, Scott Chapman wrote:
> > It appears that the aggregation functions of parenthesis are limited.
> > This does not work:
> > <html([ \t]*>)|([ \t:].+?>)
> Seems to be a point, where re gets very sophisticated ;-)
Seems to me to be a point where re fails to get very sophisticated!
> You should try to explain in plain words what this re is expected to do
> (and each part of it). This often helps to find logical mistakes
> (Pointer: "ab|c" doesn't look for ab or ac but for ab or c).
This is actually supposed to be quite simple:
followed by either:
I wanted to see if parenthesis would work for grouping in this context. They
don't work here.
> On the other hand, you should possibly use a different overall design:
> instead of one very tricky all-purpose regexp you should break your html
> in first step into tags, then get the type of the tag then the attributes
> then the values of the attributes (or any order like this).
I'm not seriously considering using RE to parse the HTML. I'm already looking
into HTMLParser. I don't like it very well but I can make it work. This
stuff is to help me learn re at this point.
> This way, you look for a complete tag and give it to a function to analyse
> this tag further (look if it match certain conditions or retrieve the
> content). When I would try to "simply" parse html without a complete
> htmlParser, I would do it this way.
> > test(r'<html[([ \t]*)([ \t:].+?)]>')
> > <html blah>
> > Traceback (most recent call last):
> > sre_constants.error: unbalanced parenthesis
> -brackets build a character set. With '[([ \t]*)([ \t:].+?)]' they are
> simply missused. '(([ \t]*)([ \t:].+?))' is working but a noop for
> aggregation (not for retrieving values).
My point in trying this was again to see how useful the parenthesis are for
grouping. I understand that -brackets are used to indicate "one of the
contents of the brackets".
I'd like to see parenthesis work for grouping on either side of a | (boolean
OR indicator) and inside of square brackets to allow selection of one among a
group of items in parenthesis; again a boolean OR situation. No go! Oh well.
Maybe the language guru's can add it in at some point. I'm not savvy enough
about re to know if this is a good idea or not. It would make it possible to
do some very nice stuff with re that's currently not this simple. I think it
would make re a lot more powerful.