[Tutor] newbie re question

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Mon Jun 30 14:07:02 2003


On Mon, 30 Jun 2003 tpc@csua.berkeley.edu wrote:

> hi Danny, I had a question about your quick intro to Python lesson sheet
> you gave out back in March 2001.

Hi tpc,



Some things will never die.  *grin*



> The last page talks about formulating a regular expression to handle
> URLs, and you have the following:
>
> myre = re.compile(r'http://[\w\.-/]+\.?(?![\w.-/])')



Ok.  Let's split that up using verbose notation:


###
myre = re.compile(r'''http://            ## protocol
                      [\w\.-/]+          ## followed by a bunch of "word"
                                         ## characters

                      \.?                ## followed by an optional
                                         ## period.

                      (?!                ## Topped with a negative
                                         ## lookahead for
                            [\w.-/]      ## "word" character.

                      )''', re.VERBOSE)
###


The page:

    http://www.python.org/doc/lib/re-syntax.html

has more details about some of the regular expression syntax.  AMK has
written a nice regex HOWTO here:

    http://www.amk.ca/python/howto/regex/regex.html

which you might find useful.




> I understand \w stands for any word character, and \. means escaped
> period, and ? means zero or one instances of a character or set.  I did
> a myre.search('http://www.hotmail.com') which yielded a result, but I am
> confused as to why
>
> myre.search('http://www.sfgate.com/cgi-bin/article.cgi?f=/gate/archive/2003/06/29/gavin29.DTL')
>
> would work, since there is a '=' and you don't provide for one in the
> regular expression.


Very true.  It should, however, match against the negative lookahead ---
the regex tries to look ahead to see that it can match something like:

    "This is an url: http://python.org.  Isn't that neat?"


The negative lookup should match right here:

    "This is an url: http://python.org.  Isn't that neat?"
                                       ^

In your url above, the negative lookahead should actually hit the question
mark first before it sees '='.  That regex was a sloppy example; I should
have been more careful with it, but I was in a hurry when I wrote that
intro...  *grin*



If you're in the Berkeley area, by the way, you might want to see if
Ka-Ping Yee is planning another CS 198 class in the future:

    http://zesty.ca/bc/info.html





Anyway, we can experiment with this more easily by introducing a group
into the regular expression:

###
myre = re.compile(r'''
                    (                  ## group 1

                      http://            ## protocol
                      [\w\.-/]+          ## followed by a bunch of "word"
                                         ## characters

                      \.?                ## followed by an optional
                                         ## period.

                    )                  ## end group


                      (?!                ## Topped with a negative
                                         ## lookahead for
                            [\w.-/]      ## "word" character.

                      )''', re.VERBOSE)
###



Let's check it now:

###
>>> match =
myre.search('http://www.sfgate.com/cgi-bin/article.cgi?f=/gate/archive/2003/06/29/gavin29.DTL')
>>> match.group(1)
'http://www.sfgate.com/cgi'
###



Oiii!  The regular expression is broken.  What has happened is that I've
incorrectly defined the hyphen in the character group.  That is, instead
of


    [\w\.-/]+


I should have done:

    [\w./-]+


instead, to keep the regex engine from treating the hyphen as a span of
characters (like "[a-z]", or "[0-9]").  You can then introduce the other
characters into the "word" character class, and then it should correctly
match the sfgate url.



I hope this helps!