How to escape strings for re.finditer?

Tue Feb 28 08:35:35 EST 2023

On 2/28/2023 4:33 AM, Roel Schroeven wrote:
> Op 28/02/2023 om 3:44 schreef Thomas Passin:
>> On 2/27/2023 9:16 PM, avi.e.gross at gmail.com wrote:
>>> And, just for fun, since there is nothing wrong with your code, this 
>>> minor change is terser:
>>>
>>>>>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>>>>>> for match in re.finditer(re.escape('abc_degree + 1') , example):
>>> ...     print(match.start(), match.end())
>>> ...
>>> ...
>>> 4 18
>>> 26 40
>>
>> Just for more fun :) -
>>
>> Without knowing how general your expressions will be, I think the 
>> following version is very readable, certainly more readable than regexes:
>>
>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>> KEY = 'abc_degree + 1'
>>
>> for i in range(len(example)):
>>     if example[i:].startswith(KEY):
>>         print(i, i + len(KEY))
>> # prints:
>> 4 18
>> 26 40
> I think it's often a good idea to use a standard library function 
> instead of rolling your own. The issue becomes less clear-cut when the 
> standard library doesn't do exactly what you need (as here, where 
> re.finditer() uses regular expressions while the use case only uses 
> simple search strings). Ideally there would be a str.finditer() method 
> we could use, but in the absence of that I think we still need to 
> consider using the almost-but-not-quite fitting re.finditer().
> 
> Two reasons:
> 
> (1) I think it's clearer: the name tells us what it does (though of 
> course we could solve this in a hand-written version by wrapping it in a 
> suitably named function).
> 
> (2) Searching for a string in another string, in a performant way, is 
> not as simple as it first appears. Your version works correctly, but 
> slowly. In some situations it doesn't matter, but in other cases it 
> will. For better performance, string searching algorithms jump ahead 
> either when they found a match or when they know for sure there isn't a 
> match for some time (see e.g. the Boyer–Moore string-search algorithm). 
> You could write such a more efficient algorithm, but then it becomes 
> more complex and more error-prone. Using a well-tested existing function 
> becomes quite attractive.

Sure, it all depends on what the real task will be.  That's why I wrote 
"Without knowing how general your expressions will be". For the example 
string, it's unlikely that speed will be a factor, but who knows what 
target strings and keys will turn up in the future?

> To illustrate the difference performance, I did a simple test (using the 
> paragraph above is test text):
> 
>      import re
>      import timeit
> 
>      def using_re_finditer(key, text):
>          matches = []
>          for match in re.finditer(re.escape(key), text):
>              matches.append((match.start(), match.end()))
>          return matches
> 
> 
>      def using_simple_loop(key, text):
>          matches = []
>          for i in range(len(text)):
>              if text[i:].startswith(key):
>                  matches.append((i, i + len(key)))
>          return matches
> 
> 
>      CORPUS = """Searching for a string in another string, in a 
> performant way, is
>      not as simple as it first appears. Your version works correctly, 
> but slowly.
>      In some situations it doesn't matter, but in other cases it will. 
> For better
>      performance, string searching algorithms jump ahead either when 
> they found a
>      match or when they know for sure there isn't a match for some time 
> (see e.g.
>      the Boyer–Moore string-search algorithm). You could write such a more
>      efficient algorithm, but then it becomes more complex and more 
> error-prone.
>      Using a well-tested existing function becomes quite attractive."""
>      KEY = 'in'
>      print('using_simple_loop:', 
> timeit.repeat(stmt='using_simple_loop(KEY, CORPUS)', globals=globals(), 
> number=1000))
>      print('using_re_finditer:', 
> timeit.repeat(stmt='using_re_finditer(KEY, CORPUS)', globals=globals(), 
> number=1000))
> 
> This does 5 runs of 1000 repetitions each, and reports the time in 
> seconds for each of those runs.
> Result on my machine:
> 
>      using_simple_loop: [0.13952950000020792, 0.13063130000000456, 
> 0.12803450000001249, 0.13186180000002423, 0.13084610000032626]
>      using_re_finditer: [0.003861400000005233, 0.004061900000124297, 
> 0.003478999999970256, 0.003413100000216218, 0.0037320000001273]
> 
> We find that in this test re.finditer() is more than 30 times faster 
> (despite the overhead of regular expressions.
> 
> While speed isn't everything in programming, with such a large 
> difference in performance and (to me) no real disadvantages of using 
> re.finditer(), I would prefer re.finditer() over writing my own.
>