doubling the number of tests, but not taking twice as long
Cameron Simpson
cs at cskk.id.au
Wed Jul 18 19:05:40 EDT 2018
On 18Jul2018 17:40, Larry Martell <larry.martell at gmail.com> wrote:
>On Tue, Jul 17, 2018 at 11:43 AM, Neil Cerutti <neilc at norwich.edu> wrote:
>> On 2018-07-16, Larry Martell <larry.martell at gmail.com> wrote:
>>> I had some code that did this:
>>>
>>> meas_regex = '_M\d+_'
>>> meas_re = re.compile(meas_regex)
>>>
>>> if meas_re.search(filename):
>>> stuff1()
>>> else:
>>> stuff2()
>>>
>>> I then had to change it to this:
>>>
>>> if meas_re.search(filename):
>>> if 'MeasDisplay' in filename:
>>> stuff1a()
>>> else:
>>> stuff1()
>>> else:
>>> if 'PatternFov' in filename:
>>> stuff2a()
>>> else:
>>> stuff2()
>>>
>>> This code needs to process many tens of 1000's of files, and it
>>> runs often, so it needs to run very fast. Needless to say, my
>>> change has made it take 2x as long. Can anyone see a way to
>>> improve that?
As others have mentioned, your stuff*() function must be doing very little
work, because I'd expect the regexp stuff to be fairly quick.
>Yeah, that was my first thought, but I haven't been able to come up
>with a regex that works.
>
>There are 4 cases I need to detect:
>
>case1 = 'spam_M123_eggs_MeasDisplay_sausage'
>case2 = 'spam_M123_eggs_sausage_and_spam'
>case3 = 'spam_spam_spam_PatternFov_eggs_sausage_and_spam'
>case4 = 'spam_spam_spam_eggs_sausage_and_spam'
>
>I thought this regex would work:
>
>'(_M\d+_){0,1}.*?(MeasDisplay|PatternFOV){0,1}'
Did you try making that a raw string:
r'(......}'
to avoid mangling the backslashes (which Python will interpret before they get
to the regexp parser)?
Print meas_regex to check it got past Python intact. Just print(meas_regex).
Also, "{0,1}" is usually written "?".
>And then I could look at the match objects and see which of the 4
>cases it was. But try as I might, I could not get it to work. Any
>regex gurus want to tell me what I am doing wrong here?
Backslashes aside, it looks ok to me. So I'd better run it... Code:
from __future__ import print_function
import re
case1 = 'spam_M123_eggs_MeasDisplay_sausage'
case2 = 'spam_M123_eggs_sausage_and_spam'
case3 = 'spam_spam_spam_PatternFov_eggs_sausage_and_spam'
case4 = 'spam_spam_spam_eggs_sausage_and_spam'
meas_regex = r'(_M\d+_){0,1}.*?(MeasDisplay|PatternFOV){0,1}'
print("meas_regex =", meas_regex)
meas_re = re.compile(meas_regex)
for case in case1, case2, case3, case4:
print(case, end=" ")
m = meas_re.search(case)
if m:
print("MATCH: group1 =", m.group(1), "group2 =", m.group(2))
else:
print("NO MATCH")
Output:
meas_regex = (_M\d+_){0,1}.*?(MeasDisplay|PatternFOV){0,1}
spam_M123_eggs_MeasDisplay_sausage MATCH: group1 = None group2 = None
spam_M123_eggs_sausage_and_spam MATCH: group1 = None group2 = None
spam_spam_spam_PatternFov_eggs_sausage_and_spam MATCH: group1 = None group2 = None
spam_spam_spam_eggs_sausage_and_spam MATCH: group1 = None group2 = None
Ah, and there's the problem. Though I'm surprised to get the Nones in the
.group()s instead of the empty string; possibly that reflects "0 occurences".
[...] A little testing with other tweaks to the regexp supports that. No
matter. To your problem:
When you write "(_M\d+_){0,1}" or anything that is optional like that, it can
match the empty string (the "0"). And that _always_ matches.
Likewise the second part of the pattern.
Because you want to know about _both_ the "M\d+_" _and_ the
"MeasDisplay|PatternFOV" you can't put them both in the same pattern: if you
make them optional, the pattern always matches the empty string even if the
target is later on; if you make them mandatory (no "{0,1}") your pattern will
only work when both are present.
Similar pitfalls apply for any combination, making one optional and the other
mandatory: you can't do all 4 possibilities (niether, just the first, just the
second, both) with one regex (== one match/search test).
So your code was already optimal.
I am surprised that your program took twice a long to run with your doubled
test though. These are filenames, yes? So shouldn't the stuff*() functions be
openin the file or something: I would expect that to dominate the runtime and
your extra name testing to not be the slowdown.
What's going on inside the stuff*() functions? Might they also have become more
complex with your new cases?
Cheers,
Cameron Simpson <cs at cskk.id.au>
More information about the Python-list
mailing list