[Tutor] alternation regex--greedy?
Kent Johnson
kent37 at tds.net
Fri Nov 4 19:38:35 CET 2005
Chris or Leslie Smith wrote:
> I found something unexpected about regex patterns
> that use alternation: the match is not greedy. The result depends on
> which pattern is listed first.
Yes, this is exactly as described in the docs for "|":
As the target string is scanned, REs separated by "|" are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the "|" operator is never greedy.
>
> ###
>
>>>> import re pat=re.compile('this|th') s='thistle'
>>>> pat.match(s).group(0)
>
> 'this'
>
>>>> pat=re.compile('th|this') pat.match(s).group(0)
>
> 'th' ###
>
>
> I read the regex docs about the 'match' function and they say that
> the 'leftmost non-overlapping occurrences of pattern in string' will
> be returned. I can understand why the pattern 'light|sli' (or
> 'sli|light') will match 'sli' in text 'slight' (because 'sli' is more
> to the left than 'light'). But why isn't the greedy result of 'this'
> obtained for either of the patterns given in the example above
> (between the ###)? Is this because it is trying to return a pattern
> that overlaps as little as possible with anything else?
The first pattern does return 'this'.
> Basically what this is meaning for me so far is that a single-pass
> solution to replacing certain text sequences with others cannot
> always be done unless the patterns to be found are non-overlapping
> themselves. In the case of 'light|sli' even though 'light' appears
> first in the alternation, 'sli' will be found with higher priority in
> sequences that had 'slight' in them since 'sli' is more to the left.
If you want to replace 'light' rather than 'sli' in 'slight' then a multi-pass replace is the simplest solution. You could make a pattern that will do this but it won't scale well to many replacements.
You can probably divide your replacements into a few groups such that none of the patterns in group A can overlap, and they all must be done before the replacements in group B, etc. Then apply the replacements in groups rather than each separately.
Kent
--
http://www.kentsjohnson.com
More information about the Tutor
mailing list