[Tutor] Regular Expression question [character classes will quote their contents!]

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Mon Apr 21 17:04:01 2003


> > > test(r'<html[([ \t]*)([ \t:].+?)]>')


Hi Scott,


If you're trying to use brackets as grouping operators, that's the
problem.  The brackets '[' and ']' are meant to define character classes,
so the brackets here:

         test(r'<html[([ \t]*)([ \t:].+?)]>')
                     ^                   ^

are very ambiguous to me.



> My point in trying this was again to see how useful the parenthesis are
> for grouping.  I understand that []-brackets are used to indicate "one
> of the contents of the brackets".


Here are some examples of character classes:

    [aeiou]
    [ \t\n]
    [()]


And that last example is supposed to illustrate a particular feature of
character classes: they escape the normal meaning of the special regular
expression characters!

Character classes are shorthand, and use their own rules for what they
treat as special.  Let's see what the last case will recognize:

###
>>> import re
>>> regex = re.compile("[()]+")
>>> regex.match("(()()()(((())((())())()")
<_sre.SRE_Match object at 0x81680d8>
>>> regex.match("foo")
>>>
###


And now we've just written a regular expression to recognize literal
parentheses.  *grin*



This is probably one of the main problems that you've been running into.
The reason that they're "shorthand" is because we don't need them: we can
just as easily write something like:

    (a|e|i|o|u)

instead of

    [aeiou]

Using the character class is easier to type, but you have to be aware that
its shortcut nature turns on a few more rules that will conflict with what
you already know about regular expressions.  For example, hyphens in a
character class will specify a "range" of characters:


    [a-z]     # (a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z)

And things that used to be special (like '.', '+', or '(') are treated as
literal characters in a character class.  In essense, character classes
are a shortcut to let us not have to type so much.




> I'd like to see parenthesis work for grouping on either side of a |
> (boolean OR indicator)

Sounds ok so far.



> and inside of square brackets to allow selection of one among a group of
> items in parenthesis;

And that's the part that probably won't work the way you expect, precisely
because the parentheses themselves will be treated as one of the
characters in the character class.


To get the effect that you want, we can do is something like this:

    (
        [aeiou]
        |
        [0123456789]
    )

"Either a vowel, or a digit."


Note that trying something like:

###
regex = re.compile("[(aeiou)|(0123456789)]")
###


does NOT have the same meaning, since the parentheses themselves (as well
as the '|' symbol) end up being part of the character class.

###
>>> regex.match('|')
<_sre.SRE_Match object at 0x8125fd0>
>>> regex.match('a')
<_sre.SRE_Match object at 0x8125a58>
>>> regex.match('f')
###


If we wanted to choose either a vowel or a digit, it's probably easiest
just to say:

     [aeiou0-9]

in a single character class.




> It would make it possible to do some very nice stuff with re that's
> currently not this simple.  I think it would make re a lot more
> powerful.

Regular expressions are powerful, tremendously so.  The problem is that
they are not necessarily easy for humans to express them properly the
first time.  *grin*

Play around with them a little more, and you should get the hang of them.



Please feel free to ask more questions here on Tutor.  Good luck!