[Tutor] Newbie - regex question

Mon Aug 30 23:56:34 CEST 2010

On Mon, Aug 30, 2010 at 10:52 PM, Sam M <sm9191 at gmail.com> wrote:
> Hi Guys,
>
> I'd like remove contents between tags <email> that matches pattern "WORD1"
> as follows:
>
> Change
> "stuff <email>WORD1-EMAILID at DOMAIN.COM</email> more stuff
> <email>WORD1-EMAILID at DOMAIN.COM</email> still more stuff
> <email>WORD2-EMAILID at DOMAIN.COM</email> stuff after WORD2
> <email>WORD1-EMAILID at DOMAIN.COM</email>"
>
> To
> "stuff  more stuff  still more stuff <email>WORD2-EMAILID at DOMAIN.COM</email>
> stuff after WORD2 "
>
> The following did not work
> newl = re.sub (r'<email>WORD1-.*</email>',"",line)
>

This precise problem is actually described in the re documentation on
python.org:

http://docs.python.org/howto/regex.html#greedy-versus-non-greedy

In short: .* is greedy and gobbles up as much as it can. That means
</email> will resolve to the last </email> tag in the line, and all
the previous ones are simply eaten by .*

To solve, we have the non-greedy patterns. They eat not as much
possible, but as little as possible. To make a qualifier non-greedy,
simply add an asterix at its end:

r'<email>WORD1-.*?</email>'

Hugo