[Repost] Re: [Tutor] newbie re question

tpc@csua.berkeley.edu tpc@csua.berkeley.edu
Thu Jul 10 10:45:02 2003


hi Danny, I sent this to the list two days ago and to you yesterday and
wasn't sure if you received it.

---------- Forwarded message ----------
Date: Wed, 9 Jul 2003 07:33:03 -0700 (PDT)
From: tpc@csua.berkeley.edu
To: dyoo@hkn.eecs.berkeley.edu
Subject: Re: [Tutor] newbie re question (fwd)

hi Danny, I sent this to the list yesterday and wasn't sure if you
received it.

---------- Forwarded message ----------
Date: Tue, 8 Jul 2003 10:41:44 -0700 (PDT)
From: tpc@csua.berkeley.edu
To: tutor@python.org
Subject: Re: [Tutor] newbie re question


hi Danny,

ah yes, I have seen Ping at various parties (and wearing a PythonLabs
shirt no less!).  But I digress.  I am still confused why you provided for
a negative lookahead.  I looked at amk's definition of a negative lookahead,
and it seems to say the regex will not match if the negative lookahead
condition is met.  So:

>>> testsearch = re.compile('tetsuro(?!hello)', re.IGNORECASE)
>>> testsearch.search('tetsurohello')
>>> testsearch.search('hitetsuroone')
<_sre.SRE_Match object at 0x860e4a0>

Now in the case of:

>>> myre = re.compile(r'http://[\w\.-]+\.?(?![\w.-/])')

you are looking for 'http://' then one or more word characters, periods
and hyphens, and an optional period and then a negative lookahead of a
word character, any character, a hyphen and a forward slash.  Granted,
your regex may have been sloppy example and you might have meant a
negative lookahead of a word character, a period, a hyphen and a forward
slash.  I still do not understand why you provided for one, and if you had
a good reason, why the sfgate url would match at all, since you clearly
had a word character, period, or hyphen following the set of characters
you were allowing for, including the optional period.  Here is an example
of something similar that perplexes:

>>> testsearch = re.compile(r'tetsuro\.?(?!hello)', re.IGNORECASE)
>>> testsearch.search('tetsurohello')
>>> testsearch.search('tetsuro.hello')
<_sre.SRE_Match object at 0x8612028>
>>> match = testsearch.search('tetsuro.hello')
>>> match.group()
'tetsuro'
>>> match = testsearch.search('tetsuro..hello')
>>> match.group()
'tetsuro.'

Why wasn't the first period caught ?

On Mon, 30 Jun 2003, Danny Yoo wrote:

>
>
> On Mon, 30 Jun 2003 tpc@csua.berkeley.edu wrote:
>
> > hi Danny, I had a question about your quick intro to Python lesson sheet
> > you gave out back in March 2001.
>
> Hi tpc,
>
>
>
> Some things will never die.  *grin*
>
>
>
> > The last page talks about formulating a regular expression to handle
> > URLs, and you have the following:
> >
> > myre = re.compile(r'http://[\w\.-/]+\.?(?![\w.-/])')
>
>
>
> Ok.  Let's split that up using verbose notation:
>
>
> ###
> myre = re.compile(r'''http://            ## protocol
>                       [\w\.-/]+          ## followed by a bunch of "word"
>                                          ## characters
>
>                       \.?                ## followed by an optional
>                                          ## period.
>
>                       (?!                ## Topped with a negative
>                                          ## lookahead for
>                             [\w.-/]      ## "word" character.
>
>                       )''', re.VERBOSE)
> ###
>
>
> The page:
>
>     http://www.python.org/doc/lib/re-syntax.html
>
> has more details about some of the regular expression syntax.  AMK has
> written a nice regex HOWTO here:
>
>     http://www.amk.ca/python/howto/regex/regex.html
>
> which you might find useful.
>
>
>
>
> > I understand \w stands for any word character, and \. means escaped
> > period, and ? means zero or one instances of a character or set.  I did
> > a myre.search('http://www.hotmail.com') which yielded a result, but I am
> > confused as to why
> >
> > myre.search('http://www.sfgate.com/cgi-bin/article.cgi?f=/gate/archive/2003/06/29/gavin29.DTL')
> >
> > would work, since there is a '=' and you don't provide for one in the
> > regular expression.
>
>
> Very true.  It should, however, match against the negative lookahead ---
> the regex tries to look ahead to see that it can match something like:
>
>     "This is an url: http://python.org.  Isn't that neat?"
>
>
> The negative lookup should match right here:
>
>     "This is an url: http://python.org.  Isn't that neat?"
>                                        ^
>
> In your url above, the negative lookahead should actually hit the question
> mark first before it sees '='.  That regex was a sloppy example; I should
> have been more careful with it, but I was in a hurry when I wrote that
> intro...  *grin*
>
>
>
> If you're in the Berkeley area, by the way, you might want to see if
> Ka-Ping Yee is planning another CS 198 class in the future:
>
>     http://zesty.ca/bc/info.html
>
>
>
>
>
> Anyway, we can experiment with this more easily by introducing a group
> into the regular expression:
>
> ###
> myre = re.compile(r'''
>                     (                  ## group 1
>
>                       http://            ## protocol
>                       [\w\.-/]+          ## followed by a bunch of "word"
>                                          ## characters
>
>                       \.?                ## followed by an optional
>                                          ## period.
>
>                     )                  ## end group
>
>
>                       (?!                ## Topped with a negative
>                                          ## lookahead for
>                             [\w.-/]      ## "word" character.
>
>                       )''', re.VERBOSE)
> ###
>
>
>
> Let's check it now:
>
> ###
> >>> match =
> myre.search('http://www.sfgate.com/cgi-bin/article.cgi?f=/gate/archive/2003/06/29/gavin29.DTL')
> >>> match.group(1)
> 'http://www.sfgate.com/cgi'
> ###
>
>
>
> Oiii!  The regular expression is broken.  What has happened is that I've
> incorrectly defined the hyphen in the character group.  That is, instead
> of
>
>
>     [\w\.-/]+
>
>
> I should have done:
>
>     [\w./-]+
>
>
> instead, to keep the regex engine from treating the hyphen as a span of
> characters (like "[a-z]", or "[0-9]").  You can then introduce the other
> characters into the "word" character class, and then it should correctly
> match the sfgate url.
>
>
>
> I hope this helps!
>
>