python replace/sub/wildcard/regex issue

Tue Jan 19 07:23:00 EST 2010

On Jan 18, 11:04 pm, tom <badoug... at gmail.com> wrote:
> hi...
>
> trying to figure out how to solve what should be an easy python/regex/
> wildcard/replace issue.
>
> i've tried a number of different approaches.. so i must be missing
> something...
>
> my initial sample text are:
>
> Soo Choi</span>LONGEDITBOX">Apryl Berney
> Soo Choi</span>LONGEDITBOX">Joel Franks
> Joel Franks</span>GEDITBOX">Alexander Yamato
>
> and i'm trying to get
>
> Soo Choi foo Apryl Berney
> Soo Choi foo Joel Franks
> Joel Franks foo Alexander Yamato
>
> the issue i'm facing.. is how to start at "</" and end at '">' and
> substitute inclusive of the stuff inside the regex...
>
> i've tried derivations of
>
> name=re.sub("</s[^>]*\">"," foo ",name)
>
> but i'm missing something...
>
> thoughts... thanks
>
> tom

The problem here is that </s matches itself correctly.  However, [^>]*
consumes anything that's not > and then stops when it hits something
that is >.  So, [^>]* consumes "pan" in each case, then tries to match
\">, but fails since there isn't a ", so the match ends.  It never
makes it to the second >.

I agree with Chris Rebert, regexes are dangerous because the number of
possible cases where you can match isn't always clear (see the above
explanation :).  Also, if the number of comparisons you have to do
isn't high, they can be inefficient.  However, for your limited set of
examples the following should work:

aList = ['Soo Choi</span>LONGEDITBOX">Apryl Berney',
        'Soo Choi</span>LONGEDITBOX">Joel Franks',
        'Joel Franks</span>GEDITBOX">Alexander Yamato']

matcher = re.compile(r"<[\w\W]*>")

newList = []
for x in aList:
    newList.append(matcher.sub(" foo ", x))

print newList

David