[Tutor] Regular Expression question
Danny Yoo
dyoo@hkn.eecs.berkeley.edu
Mon Apr 21 01:40:02 2003
> > > It appears that the aggregation functions of parenthesis are limited.
> > > This does not work:
> > >
> > > <html([ \t]*>)|([ \t:].+?>)
Hi Scott,
Let me see if i can decipher this one... *grin* To make this easier on my
eyes, I will spread out the regular expression a bit so I see where the
groups form. I'll try to break:
<html([ \t]*>)|([ \t:].+?>)
using some informal indentation.
<html
(
[ \t]*>
)
|
(
[ \t:]
.+?>
)
... This actually looks syntactically ok to me! (One note: we can
collapse the '[ \t]' parts by using the metacharacter for whitespace,
'\s'). Let me try it really fast to see if this does raise that improper
indentation error:
###
>>> regex = re.compile(r'''
... <html
... (
... \s*>
... )
... |
... (
... [\s:]
... .+?
... )''', re.VERBOSE)
>>> regex.match('<html >')
<_sre.SRE_Match object at 0x15e740>
>>> regex.match('<html :') ## ?! Why not?
>>> regex.match('<html :>') ## ??
>>> regex.match('<html :') ## wait a sec...
>>> regex.match('::::::') ## that's better!
<_sre.SRE_Match object at 0x111da0>
###
So I'm not getting a unbalanced parentheses error, though I am running
into some unexpected behavior with what the pattern recognizes. Ah, I see
now. We'll explain that mysterious behavior above in just a moment, so
let's dive into the rest of your question.
> This is actually supposed to be quite simple:
> <html
> followed by either:
> [ \t]*>
> or
> [ \t:].+?>
Scott, the ORing operator '|' has really low precedence, so if we say
something like:
<html([ \t]*>)|([ \t:].+?>)
Python's regular expression engine is actually breaking it down into
<html([ \t]*>)
|
([ \t:].+?>)
That is, instead of telling the regex engine this:
> <html
> followed by either:
> [ \t]*>
> or
> [ \t:].+?>
we've actually told it to recognize this instead:
> <html followed by [ \t]*>
> or
> [ \t:].+?>
To fix this, we may want to use parentheses (either grouping or non
grouping should work) to get that 'either' behavior that you want:
<html
(
(
[ \t]*>
)
|
(
[ \t:]
.+?>
)
)
Please feel free to ask more questions on this; regular expressions have a
few zaps that we need to be wary of --- the way that the OR operator works
is one of those zaps.
As you play with regular expressions, you may want to use VERBOSE mode to
define them: verbose mode lets us use indentation to make the regex more
readable to humans.
I hope this helps!