Re: [Python-ideas] FInd first tuple argument for str.find and str.index
Hi Ron
"Ron" == Ron Adam <rrr@ronadam.com> writes: Ron> I was thinking of something a bit more light weight.
Ah, now we got to what you actually want to do :-) Ron> For more complex stuff I think the 're' module already does pretty Ron> much what you are describing. It may even already take advantage of Ron> the algorithms you referred to. If not, that would be an important Ron> improvement to the re module. :-) Yes, that would make a good SoC project. But as you say it may already be done that way. Ron> The use case I had in mind was to find starting and ending delimiters. Ron> And to avoid the following type of awkward code. (This would work for Ron> finding other things as well of course.) start = 0 while start < len(s): i1 = s.find('{', start) if i1 == -1: i1 = len(s) i2 = s.find('}', start) if i2 == -1: i2 = len(s) # etc... for as many search terms as you have... # or use a loop to locate each one. start = min(i1, i2) if start == len(s): break ... # do something with s[start] ... Ron> That works but it has to go through the string once for each item. It's worse than that. _Each time around the loop_ it tests all candidates against all of the remaining text. Imagine matching L patterns that were each 'a' * M against a text of 'a' * N. You're going to do (roughly) O(N * L * M) comparisons, using naive string matching. You could completely drop patterns that have already returned -1. You can short-circuit a loop if you ever got a 0 index back. Sorry if this seems picky - I guess you're just writing quick pseudo-code. Ron> Of course I would use 're' for anything more complex than a few fixed Ron> length terms. Yes. It depends partly on what you want back. You could write a super-fast iterator based on A&C that told you everything you need to know, with guaranteed linear worst case behavior, but that seems like overkill here (and it may be in re, as you say). Ron> If the function returns something other than a simple index, then I Ron> think it will need to be a new function or method and not just an Ron> alteration of str.index and str.find. In that case it may also need a Ron> PEP. Be my guest :-) Regards, Terry
Terry Jones wrote:
It's worse than that. _Each time around the loop_ it tests all candidates against all of the remaining text. Imagine matching L patterns that were each 'a' * M against a text of 'a' * N. You're going to do (roughly) O(N * L * M) comparisons, using naive string matching.
Yep, it's bad, which is why I don't want to do it this way. Of course it's much better than ... for char in string: ... etc ...
You could completely drop patterns that have already returned -1. You can short-circuit a loop if you ever got a 0 index back. Sorry if this seems picky - I guess you're just writing quick pseudo-code.
Well its a but if both pseudo-code and from an actual problem I was working on. It was a first unoptimized version. First rule is to get something that works, then make it fast after it's tested. Right? :-) No you aren't being any more picky than I am. I didn't like that solution either which is why I suggested improving index and find just a bit. And I'm not real keen on this next one although it will probably work better. (not tested) length = len(s) cases = dict((c, 0) for c in terms) while 1: for term, i in cases.items(): if i == start: i = s.find(term, start) cases[term] = i if i > -1 else length start = min(cases.values()) if start == length: break ... Do something with s[start] Return some result we constructed or found along the way. That's an improvement, But it's still way more work than I think the problem should need. It's also complex enough that it's no longer obvious just what it's doing. I still like the tuple version of index, or find better. Much easier to read and understand. ... try: start = s.index(('{', '}'), start) except: break ... ... start = s.find(('{', '}'), start) if start == -1: break ...
Ron> Of course I would use 're' for anything more complex than a few fixed Ron> length terms.
Yes. It depends partly on what you want back. You could write a super-fast iterator based on A&C that told you everything you need to know, with guaranteed linear worst case behavior, but that seems like overkill here (and it may be in re, as you say).
Regular expression can be slower for small things because it has more overhead. It's not always the fasted choice. And I've read they are not good at is parsing nested delimiters. Cheers, RON
Ron> If the function returns something other than a simple index, then I Ron> think it will need to be a new function or method and not just an Ron> alteration of str.index and str.find. In that case it may also need a Ron> PEP.
Be my guest :-)
Regards, Terry
participants (2)
-
Ron Adam
-
Terry Jones