Something confusing about non-greedy reg exp match

Sun Sep 6 23:26:14 EDT 2009

On Sep 6, 10:06 pm, "Mark Tolonen" <metolone+gm... at gmail.com> wrote:
> <gburde... at gmail.com> wrote in message
>
> news:f98a6057-c35f-4843-9efb-7f36b05b677c at g19g2000yqo.googlegroups.com...
>
> > If I do this:
>
> > import re
> > a=re.search(r'hello.*?money',  'hello how are you hello funny money')
>
> > I would expect a.group(0) to be "hello funny money", since .*? is a
> > non-greedy match. But instead, I get the whole sentence, "hello how
> > are you hello funny money".
>
> > Is this expected behavior? How can I specify the correct regexp so
> > that I get "hello funny money" ?
>
> A non-greedy match matches the fewest characters before matching the text
> *after* the non-greedy match.  For example:
>
> >>> import re
> >>> a=re.search(r'hello.*?money','hello how are you hello funny money and
> >>> more money')
> >>> a.group(0)  # non-greedy stops at the first money
>
> 'hello how are you hello funny money'>>> a=re.search(r'hello.*money','hello how are you hello funny money and
> >>> more money')
> >>> a.group(0)  # greedy keeps going to the last money
>
> 'hello how are you hello funny money and more money'
>
> This is why it is difficult to use regular expressions to match nested
> objects like parentheses or XML tags.  In your case you'll need something
> extra to not match the first hello.
>
> >>> a=re.search(r'(?<!^)hello.*?money','hello how are you hello funny
> >>> money')
> >>> a.group(0)
>
> 'hello funny money'
>
> -Mark

I see now. I also understand r's response. But what if there are many
"hello"'s before "money," and I don't know how many there are? In
other words, I want to find every occurrence of "money," and for each
occurrence, I want to scan in the reverse (left) direction to the
closest occurrence of "hello." How can this be done?