Repeating assertions in regular expression

Devin Jeanpierre jeanpierreda at gmail.com
Tue Jan 3 14:36:00 EST 2012


> Put simply, it doesn't occur often enough to be worth it. The cost
> outweighs the potential benefit.

I don't buy it. You could backtrack instead of failing for \b+ and
\b*, and it would be almost as fast as this optimization.

-- Devin

On Tue, Jan 3, 2012 at 1:57 PM, MRAB <python at mrabarnett.plus.com> wrote:
> On 03/01/2012 09:45, Devin Jeanpierre wrote:
>>>
>>>  \\b\\b and \\b{2} aren't equivalent ?
>>
>>
>> This sounds suspiciously like a bug!
>>
>>>  Why the wording is "should never" ? Repeating a zero-width assertion is
>>> not
>>>  forbidden, for instance :
>>>
>>>>>>  import re
>>>>>>  re.compile("\\b\\b\w+\\b\\b")
>>>
>>>  <_sre.SRE_Pattern object at 0xb7831140>
>>>>>>
>>>>>>
>>
>> I believe this is meant to refer to arbitrary-length repetitions, such
>> as r'\b*', not simple concatenations like that. r'\b*' will abort the
>> whole match if is run on a boundary, because Python detects a
>> repetition of a zero-width match and decides this is an error.
>>
> r"\b+" can be optimised to r"\b", but r"\b*" can be optimised to r"".
> r"\b\b", r"\b\b\b", etc, can be optimised to r"\b".
>
> So why doesn't it optimised?
>
> Because every potential optimisation has a cost, which is the time it
> would take to look for it.
>
> That cost needs to be balanced against the potential benefit.
>
> How often do you see repeated r"\b"?
>
> Put simply, it doesn't occur often enough to be worth it. The cost
> outweighs the potential benefit.
> --
> http://mail.python.org/mailman/listinfo/python-list



More information about the Python-list mailing list