re.sub(): replace longest match instead of leftmost match?

Ian Kelly ian.g.kelly at gmail.com
Mon Dec 19 21:58:11 EST 2011


On Mon, Dec 19, 2011 at 4:15 PM,  <ting at thsu.org> wrote:
> On Dec 16, 11:49 am, John Gordon <gor... at panix.com> wrote:
>> I'm working with IPv6 CIDR strings, and I want to replace the longest
>> match of "(0000:|0000$)+" with ":".  But when I use re.sub() it replaces
>> the leftmost match, even if there is a longer match later in the string.
>
> Typically this means that your regular expression is not specific
> enough.
>
> That is, if you get multiple matches, and you need to sort through
> those matches before performing a replace, it usually means that you
> should rewrite your expression to get a single match.
>
> Invariably this happens when you try to take short cuts. I can't blame
> you for using a short cut, as sometimes short cuts just work, but once
> you find that your short cut fails, you need to step back and rethink
> the problem, rather than try to hack your short cut.

The problem isn't short cuts.  To narrow down multiple matches to a
single longest match here, two additional criteria need to be met:
there must be no other match anywhere in the search string that is
longer than the match being considered, and there must be no match of
equal length preceding the match being considered.

Note that in the general case, the language is not regular and it
would not even be possible to get a single longest match using a
regular expression.  For IPv6, it is possible only because IPv6
addresses are bounded in length.  In English, that regular expression
would be constructed like this:

Any block of four or more groups of zeroes OR
Zero or more blocks containing (zero to two groups of zero followed by
a non-zero group) followed by a block of three or more groups of
zeroes followed by zero or more blocks containing (a non-zero group
followed by zero to three groups of zeroes) OR
Zero or more blocks containing (an optional zero group followed by a
non-zero group) followed by a block of two or more groups of zeroes
followed by zero or more blocks containing (a non-zero group followed
by zero to to two groups of zeroes) OR
Zero or more non-zero groups followed by a group of zeroes followed by
zero or more blocks containing (a non-zero group followed by an
optional zero group).

If anyone wants to give a crack at translating that to an actual
regular expression, I'd be interested in seeing it.  The added
complexity is so great, though, that I for one would vastly prefer the
simple solution already proposed of getting all the matches and
iterating to find the longest.

Cheers,
Ian



More information about the Python-list mailing list