[Tutor] Regular Expression question

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Mon Apr 21 01:40:02 2003


> > > It appears that the aggregation functions of parenthesis are limited.
> > > This does not work:
> > >
> > > <html([ \t]*>)|([ \t:].+?>)

Hi Scott,

Let me see if i can decipher this one... *grin* To make this easier on my
eyes, I will spread out the regular expression a bit so I see where the
groups form.  I'll try to break:

    <html([ \t]*>)|([ \t:].+?>)

using some informal indentation.

    <html
    (
        [ \t]*>
    )
    |
    (
        [ \t:]
        .+?>
    )

... This actually looks syntactically ok to me!  (One note: we can
collapse the '[ \t]' parts by using the metacharacter for whitespace,
'\s').  Let me try it really fast to see if this does raise that improper
indentation error:

###
>>> regex = re.compile(r'''
... <html
... (
...     \s*>
... )
... |
... (
...     [\s:]
...     .+?
... )''', re.VERBOSE)
>>> regex.match('<html  >')
<_sre.SRE_Match object at 0x15e740>
>>> regex.match('<html  :')          ## ?! Why not?
>>> regex.match('<html :>')          ## ??
>>> regex.match('<html :')           ## wait a sec...
>>> regex.match('::::::')            ## that's better!
<_sre.SRE_Match object at 0x111da0>
###

So I'm not getting a unbalanced parentheses error, though I am running
into some unexpected behavior with what the pattern recognizes.  Ah, I see
now.  We'll explain that mysterious behavior above in just a moment, so
let's dive into the rest of your question.


> This is actually supposed to be quite simple:
> <html
> followed by either:
> [ \t]*>
> or
> [ \t:].+?>


Scott, the ORing operator '|' has really low precedence, so if we say
something like:

    <html([ \t]*>)|([ \t:].+?>)

Python's regular expression engine is actually breaking it down into

    <html([ \t]*>)
        |
    ([ \t:].+?>)


That is, instead of telling the regex engine this:

> <html
> followed by either:
> [ \t]*>
> or
> [ \t:].+?>


we've actually told it to recognize this instead:

> <html followed by [ \t]*>
> or
> [ \t:].+?>


To fix this, we may want to use parentheses (either grouping or non
grouping should work) to get that 'either' behavior that you want:

    <html
    (
        (
            [ \t]*>
        )
        |
        (
            [ \t:]
            .+?>
        )
    )


Please feel free to ask more questions on this; regular expressions have a
few zaps that we need to be wary of --- the way that the OR operator works
is one of those zaps.

As you play with regular expressions, you may want to use VERBOSE mode to
define them: verbose mode lets us use indentation to make the regex more
readable to humans.


I hope this helps!