doubling the number of tests, but not taking twice as long

Cameron Simpson cs at cskk.id.au
Wed Jul 18 19:05:40 EDT 2018


On 18Jul2018 17:40, Larry Martell <larry.martell at gmail.com> wrote:
>On Tue, Jul 17, 2018 at 11:43 AM, Neil Cerutti <neilc at norwich.edu> wrote:
>> On 2018-07-16, Larry Martell <larry.martell at gmail.com> wrote:
>>> I had some code that did this:
>>>
>>> meas_regex = '_M\d+_'
>>> meas_re = re.compile(meas_regex)
>>>
>>> if meas_re.search(filename):
>>>     stuff1()
>>> else:
>>>     stuff2()
>>>
>>> I then had to change it to this:
>>>
>>> if meas_re.search(filename):
>>>     if 'MeasDisplay' in filename:
>>>         stuff1a()
>>>     else:
>>>         stuff1()
>>> else:
>>>     if 'PatternFov' in filename:
>>>         stuff2a()
>>>    else:
>>>         stuff2()
>>>
>>> This code needs to process many tens of 1000's of files, and it
>>> runs often, so it needs to run very fast. Needless to say, my
>>> change has made it take 2x as long. Can anyone see a way to
>>> improve that?

As others have mentioned, your stuff*() function must be doing very little 
work, because I'd expect the regexp stuff to be fairly quick.

>Yeah, that was my first thought, but I haven't been able to come up
>with a regex that works.
>
>There are 4 cases I need to detect:
>
>case1 = 'spam_M123_eggs_MeasDisplay_sausage'
>case2 = 'spam_M123_eggs_sausage_and_spam'
>case3 = 'spam_spam_spam_PatternFov_eggs_sausage_and_spam'
>case4 = 'spam_spam_spam_eggs_sausage_and_spam'
>
>I thought this regex would work:
>
>'(_M\d+_){0,1}.*?(MeasDisplay|PatternFOV){0,1}'

Did you try making that a raw string:

  r'(......}'

to avoid mangling the backslashes (which Python will interpret before they get 
to the regexp parser)?

Print meas_regex to check it got past Python intact. Just print(meas_regex).

Also, "{0,1}" is usually written "?".

>And then I could look at the match objects and see which of the 4
>cases it was. But try as I might, I could not get it to work. Any
>regex gurus want to tell me what I am doing wrong here?

Backslashes aside, it looks ok to me. So I'd better run it... Code:

    from __future__ import print_function
    import re

    case1 = 'spam_M123_eggs_MeasDisplay_sausage'
    case2 = 'spam_M123_eggs_sausage_and_spam'
    case3 = 'spam_spam_spam_PatternFov_eggs_sausage_and_spam'
    case4 = 'spam_spam_spam_eggs_sausage_and_spam'

    meas_regex = r'(_M\d+_){0,1}.*?(MeasDisplay|PatternFOV){0,1}'
    print("meas_regex =", meas_regex)

    meas_re = re.compile(meas_regex)

    for case in case1, case2, case3, case4:
      print(case, end=" ")
      m = meas_re.search(case)
      if m:
        print("MATCH: group1 =", m.group(1), "group2 =", m.group(2))
      else:
        print("NO MATCH")

Output:

    meas_regex = (_M\d+_){0,1}.*?(MeasDisplay|PatternFOV){0,1}
    spam_M123_eggs_MeasDisplay_sausage MATCH: group1 = None group2 = None
    spam_M123_eggs_sausage_and_spam MATCH: group1 = None group2 = None
    spam_spam_spam_PatternFov_eggs_sausage_and_spam MATCH: group1 = None group2 = None
    spam_spam_spam_eggs_sausage_and_spam MATCH: group1 = None group2 = None

Ah, and there's the problem. Though I'm surprised to get the Nones in the 
.group()s instead of the empty string; possibly that reflects "0 occurences".  
[...] A little testing with other tweaks to the regexp supports that. No 
matter. To your problem:

When you write "(_M\d+_){0,1}" or anything that is optional like that, it can 
match the empty string (the "0"). And that _always_ matches.

Likewise the second part of the pattern.

Because you want to know about _both_ the "M\d+_" _and_ the 
"MeasDisplay|PatternFOV" you can't put them both in the same pattern: if you 
make them optional, the pattern always matches the empty string even if the 
target is later on; if you make them mandatory (no "{0,1}") your pattern will 
only work when both are present.

Similar pitfalls apply for any combination, making one optional and the other 
mandatory: you can't do all 4 possibilities (niether, just the first, just the 
second, both) with one regex (== one match/search test).

So your code was already optimal.

I am surprised that your program took twice a long to run with your doubled 
test though. These are filenames, yes? So shouldn't the stuff*() functions be 
openin the file or something: I would expect that to dominate the runtime and 
your extra name testing to not be the slowdown.

What's going on inside the stuff*() functions? Might they also have become more 
complex with your new cases?

Cheers,
Cameron Simpson <cs at cskk.id.au>



More information about the Python-list mailing list