[Tutor] newbie re question

Mon Jun 30 14:07:02 2003

On Mon, 30 Jun 2003 tpc@csua.berkeley.edu wrote:

> hi Danny, I had a question about your quick intro to Python lesson sheet
> you gave out back in March 2001.

Hi tpc,

Some things will never die.  *grin*

> The last page talks about formulating a regular expression to handle
> URLs, and you have the following:
>
> myre = re.compile(r'http://[\w\.-/]+\.?(?![\w.-/])')

Ok.  Let's split that up using verbose notation:

###
myre = re.compile(r'''http://            ## protocol
                      [\w\.-/]+          ## followed by a bunch of "word"
                                         ## characters

                      \.?                ## followed by an optional
                                         ## period.

                      (?!                ## Topped with a negative
                                         ## lookahead for
                            [\w.-/]      ## "word" character.

                      )''', re.VERBOSE)
###

The page:

    http://www.python.org/doc/lib/re-syntax.html

has more details about some of the regular expression syntax.  AMK has
written a nice regex HOWTO here:

    http://www.amk.ca/python/howto/regex/regex.html

which you might find useful.

> I understand \w stands for any word character, and \. means escaped
> period, and ? means zero or one instances of a character or set.  I did
> a myre.search('http://www.hotmail.com') which yielded a result, but I am
> confused as to why
>
> myre.search('http://www.sfgate.com/cgi-bin/article.cgi?f=/gate/archive/2003/06/29/gavin29.DTL')
>
> would work, since there is a '=' and you don't provide for one in the
> regular expression.

Very true.  It should, however, match against the negative lookahead ---
the regex tries to look ahead to see that it can match something like:

    "This is an url: http://python.org.  Isn't that neat?"

The negative lookup should match right here:

    "This is an url: http://python.org.  Isn't that neat?"
                                       ^

In your url above, the negative lookahead should actually hit the question
mark first before it sees '='.  That regex was a sloppy example; I should
have been more careful with it, but I was in a hurry when I wrote that
intro...  *grin*

If you're in the Berkeley area, by the way, you might want to see if
Ka-Ping Yee is planning another CS 198 class in the future:

    http://zesty.ca/bc/info.html

Anyway, we can experiment with this more easily by introducing a group
into the regular expression:

###
myre = re.compile(r'''
                    (                  ## group 1

                      http://            ## protocol
                      [\w\.-/]+          ## followed by a bunch of "word"
                                         ## characters

                      \.?                ## followed by an optional
                                         ## period.

                    )                  ## end group

                      (?!                ## Topped with a negative
                                         ## lookahead for
                            [\w.-/]      ## "word" character.

                      )''', re.VERBOSE)
###

Let's check it now:

###
>>> match =
myre.search('http://www.sfgate.com/cgi-bin/article.cgi?f=/gate/archive/2003/06/29/gavin29.DTL')
>>> match.group(1)
'http://www.sfgate.com/cgi'
###

Oiii!  The regular expression is broken.  What has happened is that I've
incorrectly defined the hyphen in the character group.  That is, instead
of

    [\w\.-/]+

I should have done:

    [\w./-]+

instead, to keep the regex engine from treating the hyphen as a span of
characters (like "[a-z]", or "[0-9]").  You can then introduce the other
characters into the "word" character class, and then it should correctly
match the sfgate url.

I hope this helps!