[Tutor] newbie re question
Danny Yoo
dyoo@hkn.eecs.berkeley.edu
Mon Jun 30 14:07:02 2003
On Mon, 30 Jun 2003 tpc@csua.berkeley.edu wrote:
> hi Danny, I had a question about your quick intro to Python lesson sheet
> you gave out back in March 2001.
Hi tpc,
Some things will never die. *grin*
> The last page talks about formulating a regular expression to handle
> URLs, and you have the following:
>
> myre = re.compile(r'http://[\w\.-/]+\.?(?![\w.-/])')
Ok. Let's split that up using verbose notation:
###
myre = re.compile(r'''http:// ## protocol
[\w\.-/]+ ## followed by a bunch of "word"
## characters
\.? ## followed by an optional
## period.
(?! ## Topped with a negative
## lookahead for
[\w.-/] ## "word" character.
)''', re.VERBOSE)
###
The page:
http://www.python.org/doc/lib/re-syntax.html
has more details about some of the regular expression syntax. AMK has
written a nice regex HOWTO here:
http://www.amk.ca/python/howto/regex/regex.html
which you might find useful.
> I understand \w stands for any word character, and \. means escaped
> period, and ? means zero or one instances of a character or set. I did
> a myre.search('http://www.hotmail.com') which yielded a result, but I am
> confused as to why
>
> myre.search('http://www.sfgate.com/cgi-bin/article.cgi?f=/gate/archive/2003/06/29/gavin29.DTL')
>
> would work, since there is a '=' and you don't provide for one in the
> regular expression.
Very true. It should, however, match against the negative lookahead ---
the regex tries to look ahead to see that it can match something like:
"This is an url: http://python.org. Isn't that neat?"
The negative lookup should match right here:
"This is an url: http://python.org. Isn't that neat?"
^
In your url above, the negative lookahead should actually hit the question
mark first before it sees '='. That regex was a sloppy example; I should
have been more careful with it, but I was in a hurry when I wrote that
intro... *grin*
If you're in the Berkeley area, by the way, you might want to see if
Ka-Ping Yee is planning another CS 198 class in the future:
http://zesty.ca/bc/info.html
Anyway, we can experiment with this more easily by introducing a group
into the regular expression:
###
myre = re.compile(r'''
( ## group 1
http:// ## protocol
[\w\.-/]+ ## followed by a bunch of "word"
## characters
\.? ## followed by an optional
## period.
) ## end group
(?! ## Topped with a negative
## lookahead for
[\w.-/] ## "word" character.
)''', re.VERBOSE)
###
Let's check it now:
###
>>> match =
myre.search('http://www.sfgate.com/cgi-bin/article.cgi?f=/gate/archive/2003/06/29/gavin29.DTL')
>>> match.group(1)
'http://www.sfgate.com/cgi'
###
Oiii! The regular expression is broken. What has happened is that I've
incorrectly defined the hyphen in the character group. That is, instead
of
[\w\.-/]+
I should have done:
[\w./-]+
instead, to keep the regex engine from treating the hyphen as a span of
characters (like "[a-z]", or "[0-9]"). You can then introduce the other
characters into the "word" character class, and then it should correctly
match the sfgate url.
I hope this helps!