How to escape strings for re.finditer?
Thomas Passin
list1 at tompassin.net
Tue Feb 28 08:35:35 EST 2023
On 2/28/2023 4:33 AM, Roel Schroeven wrote:
> Op 28/02/2023 om 3:44 schreef Thomas Passin:
>> On 2/27/2023 9:16 PM, avi.e.gross at gmail.com wrote:
>>> And, just for fun, since there is nothing wrong with your code, this
>>> minor change is terser:
>>>
>>>>>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>>>>>> for match in re.finditer(re.escape('abc_degree + 1') , example):
>>> ... print(match.start(), match.end())
>>> ...
>>> ...
>>> 4 18
>>> 26 40
>>
>> Just for more fun :) -
>>
>> Without knowing how general your expressions will be, I think the
>> following version is very readable, certainly more readable than regexes:
>>
>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>> KEY = 'abc_degree + 1'
>>
>> for i in range(len(example)):
>> if example[i:].startswith(KEY):
>> print(i, i + len(KEY))
>> # prints:
>> 4 18
>> 26 40
> I think it's often a good idea to use a standard library function
> instead of rolling your own. The issue becomes less clear-cut when the
> standard library doesn't do exactly what you need (as here, where
> re.finditer() uses regular expressions while the use case only uses
> simple search strings). Ideally there would be a str.finditer() method
> we could use, but in the absence of that I think we still need to
> consider using the almost-but-not-quite fitting re.finditer().
>
> Two reasons:
>
> (1) I think it's clearer: the name tells us what it does (though of
> course we could solve this in a hand-written version by wrapping it in a
> suitably named function).
>
> (2) Searching for a string in another string, in a performant way, is
> not as simple as it first appears. Your version works correctly, but
> slowly. In some situations it doesn't matter, but in other cases it
> will. For better performance, string searching algorithms jump ahead
> either when they found a match or when they know for sure there isn't a
> match for some time (see e.g. the Boyer–Moore string-search algorithm).
> You could write such a more efficient algorithm, but then it becomes
> more complex and more error-prone. Using a well-tested existing function
> becomes quite attractive.
Sure, it all depends on what the real task will be. That's why I wrote
"Without knowing how general your expressions will be". For the example
string, it's unlikely that speed will be a factor, but who knows what
target strings and keys will turn up in the future?
> To illustrate the difference performance, I did a simple test (using the
> paragraph above is test text):
>
> import re
> import timeit
>
> def using_re_finditer(key, text):
> matches = []
> for match in re.finditer(re.escape(key), text):
> matches.append((match.start(), match.end()))
> return matches
>
>
> def using_simple_loop(key, text):
> matches = []
> for i in range(len(text)):
> if text[i:].startswith(key):
> matches.append((i, i + len(key)))
> return matches
>
>
> CORPUS = """Searching for a string in another string, in a
> performant way, is
> not as simple as it first appears. Your version works correctly,
> but slowly.
> In some situations it doesn't matter, but in other cases it will.
> For better
> performance, string searching algorithms jump ahead either when
> they found a
> match or when they know for sure there isn't a match for some time
> (see e.g.
> the Boyer–Moore string-search algorithm). You could write such a more
> efficient algorithm, but then it becomes more complex and more
> error-prone.
> Using a well-tested existing function becomes quite attractive."""
> KEY = 'in'
> print('using_simple_loop:',
> timeit.repeat(stmt='using_simple_loop(KEY, CORPUS)', globals=globals(),
> number=1000))
> print('using_re_finditer:',
> timeit.repeat(stmt='using_re_finditer(KEY, CORPUS)', globals=globals(),
> number=1000))
>
> This does 5 runs of 1000 repetitions each, and reports the time in
> seconds for each of those runs.
> Result on my machine:
>
> using_simple_loop: [0.13952950000020792, 0.13063130000000456,
> 0.12803450000001249, 0.13186180000002423, 0.13084610000032626]
> using_re_finditer: [0.003861400000005233, 0.004061900000124297,
> 0.003478999999970256, 0.003413100000216218, 0.0037320000001273]
>
> We find that in this test re.finditer() is more than 30 times faster
> (despite the overhead of regular expressions.
>
> While speed isn't everything in programming, with such a large
> difference in performance and (to me) no real disadvantages of using
> re.finditer(), I would prefer re.finditer() over writing my own.
>
More information about the Python-list
mailing list