Changing Python's string search algorithms
Fredrik Lundh crafted our current string search algorithms, and they've served us very well. They're nearly always as fast as dumbest-possible brute force search, and sometimes much faster. This was bought with some very cheap one-pass preprocessing of the pattern (the substring to search _for_), to compute a single integer "skip", and a single 32-bit bitmask, which can sometimes be used to skip over areas where it can prove matches are doomed to fail. Unlike most "fancier than brute" algorithms, it does not build preprocessing tables of size proportional to the length of the pattern, or to the size of the alphabet. The extra space is O(1), and tiny, independent of pattern and alphabet size. A weakness is that worst-case search time is proportional to the product of the lengths of the inputs. This can be so for searches that succeed as well as for those that fail. This was recently dramatically illustrated by this bug report: https://bugs.python.org/issue41972 where a single large string search consumed well over 3 hours of CPU time before it succeeded. It "should take" a fraction of a second. While it was news to me, Dennis Sweeney was aware of that other algorithms requiring only O(1) extra space are now well known with worst-case linear time. That's really quite remarkable :-) They do require fancier (but still linear-time) preprocessing, so can't always be a pure win. So this message: string (which includes byte) search is an extremely common operation, across all kinds of applications, so changes here need as many eyeballs on them as they can get. What's typical? What's important? What doesn't matter? What about reverse searches ("rfind")? How can we quantify the acceptable region of tradeoffs? My guess is that searches for substrings less than 50 characters long in strings under ten thousand characters are most important for most apps, but I just pulled that guess out of my butt. I'm also guessing that searches for multi-million-character substrings (like in the bug report above) are rare indeed, and that any change sparing them from quadratic-time disaster would be "way more than good enough". But I don't know. If you can help, Dennis already has a preliminary PR implementing one of the newer algorithms, augmented with Fredrik's unique ideas: https://github.com/python/cpython/pull/22679 By "help" I don't necessarily mean code review - it could, e.g., be a huge help to test-drive it on your own critical string-slinging apps. Faster? Slower? A wash? Wrong answer? Crash? Due to the natures of the old and new algorithms, neither will be faster or slower in all cases, the newer one should never be dramatically slower than the old one, the new one will be spectacularly faster in some cases, and "in general" it seems impossible to guess in advance. It depends on the value the fancier preprocessing can extract from the pattern versus the higher preprocessing cost, and that in turn depends on details of exactly what the patterns and texts to be searched are.
On 10/13/20 9:54 AM, Tim Peters wrote:
Due to the natures of the old and new algorithms, neither will be faster or slower in all cases, the newer one should never be dramatically slower than the old one, the new one will be spectacularly faster in some cases, and "in general" it seems impossible to guess in advance. It depends on the value the fancier preprocessing can extract from the pattern versus the higher preprocessing cost, and that in turn depends on details of exactly what the patterns and texts to be searched are.
My off-the-top-of-my-head reaction: I've always assumed that most string searching is tiny. Like, searching for a < 10 character string in a < 256 character string. That's what I'm always doing, at least. So while I'd obviously like to see us address the pathological cases cited--and thanks to Dennis Sweeney for the PRs!--I'd hope that it doesn't make these small searches any slower. Your email didn't give me a sense of how much additional preprocessing these new algorithms require; the fact that they're O(1) space suggests they aren't bad. But if you do lots and lots of dinky searches surely it would add up. Looking at the PR, I see there's a special case for searching for a single character, and then cases for forward-search and reverse-search. I was wondering if I'd find a heuristic like "if the string you're searching in is < 2k, use the old search algorithm". Is that worth pursuing? It doesn't seem like the maintenance cost would be that high; I don't think anybody has touched that code in years. (No surprise, the Effbot did a good job.) //arry/
I'm sure that the large majority of the string searches I've done are in Larry's tiny category. However, I also think that big data needs are increasing, and things like FASTA files can be enormously large texts that one has good reasons to search on. If there is a heuristic switch between algorithms, it seems like it should threshold on needle, not on haystack. Or possibly some more complex function of both. But if I understand the two-way approach correctly (which I probably don't), there's not really much gain for small needle. On Wed, Oct 14, 2020, 12:09 AM Larry Hastings <larry@hastings.org> wrote:
On 10/13/20 9:54 AM, Tim Peters wrote:
Due to the natures of the old and new algorithms, neither will be faster or slower in all cases, the newer one should never be dramatically slower than the old one, the new one will be spectacularly faster in some cases, and "in general" it seems impossible to guess in advance. It depends on the value the fancier preprocessing can extract from the pattern versus the higher preprocessing cost, and that in turn depends on details of exactly what the patterns and texts to be searched are.
My off-the-top-of-my-head reaction: I've always assumed that most string searching is tiny. Like, searching for a < 10 character string in a < 256 character string. That's what I'm always doing, at least. So while I'd obviously like to see us address the pathological cases cited--and thanks to Dennis Sweeney for the PRs!--I'd hope that it doesn't make these small searches any slower. Your email didn't give me a sense of how much additional preprocessing these new algorithms require; the fact that they're O(1) space suggests they aren't bad. But if you do lots and lots of dinky searches surely it would add up.
Looking at the PR, I see there's a special case for searching for a single character, and then cases for forward-search and reverse-search. I was wondering if I'd find a heuristic like "if the string you're searching in is < 2k, use the old search algorithm". Is that worth pursuing? It doesn't seem like the maintenance cost would be that high; I don't think anybody has touched that code in years. (No surprise, the Effbot did a good job.)
*/arry* _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/D5GYO2FV... Code of Conduct: http://python.org/psf/codeofconduct/
Rest assured that Dennis is aware of that pragmatics may change for shorter needles. The code has always made a special-case of 1-character needles, because it's impossible "even in theory" to improve over straightforward brute force search then. Say the length of the text to search is `t`, and the length of the pattern `p`. Brute force and the current code have worst case O(t*p) behavior. The new code, worst case O(t+p). If that's as deep as you dig into it, it seems all but obvious then that O(t*p) just can't be that bad when p is small, so keep it simple. But there's another twist: the current and new code both have O(t/p) cases too (but brute force, and even Knuth-Morris-Pratt, don't). That can be highly beneficial even for p as small as 3. Unfortunately, the exact cases in which the current and new code enjoy O(t/p) behavior aren't the same. Lying a bit: In general the current code has just two tricks. One of those tricks is useless (pure waste of time) if the pattern happens to end with a repeated character pair, and is most effective if the last character appears nowhere else in the pattern. The other trick is most effective if the set of characters in the text has no intersection with the set of characters in the pattern (note that the search is certain to fail then), but useless if the set of text characters is a subset of the set of pattern characters (as, e.g., it very often is in real-life searches in apps with a small alphabet, like [ACGT}+ for genomes). But I don't know how to characterize when the new code gets maximum benefit. It's not based on intuitive tricks. The paper that introduced it[1] says it's based on "a deep theorem on words known as the Critical Factorization Theorem due to Cesari, Duval, Vincent, and Lothaire", and I still don't fully understand why it works. It's clear enough, though, that the new code gets real systematic value out of patterns with repetitions (like "abab"), where the current code gets real value from those only by accident (e.g., "a" and "b" happen to appear rarely in the text being searched, in which case that the pattern has repetitions is irrelevant). But, as I said in the first message, the PR is "preliminary". There are still worlds of tweaks that have been thought of but not yet tried even once. [1] search for "Two-Way String Matching" by Crochemore and Perrin.
Maybe someone reading this can finish the Wikipedia page on Two-Way Search? The code example trails off with a function with some incomprehensible remarks and then a TODO... On Wed, Oct 14, 2020 at 9:07 AM Tim Peters <tim.peters@gmail.com> wrote:
Rest assured that Dennis is aware of that pragmatics may change for shorter needles.
The code has always made a special-case of 1-character needles, because it's impossible "even in theory" to improve over straightforward brute force search then.
Say the length of the text to search is `t`, and the length of the pattern `p`. Brute force and the current code have worst case O(t*p) behavior. The new code, worst case O(t+p). If that's as deep as you dig into it, it seems all but obvious then that O(t*p) just can't be that bad when p is small, so keep it simple.
But there's another twist: the current and new code both have O(t/p) cases too (but brute force, and even Knuth-Morris-Pratt, don't). That can be highly beneficial even for p as small as 3.
Unfortunately, the exact cases in which the current and new code enjoy O(t/p) behavior aren't the same.
Lying a bit: In general the current code has just two tricks. One of those tricks is useless (pure waste of time) if the pattern happens to end with a repeated character pair, and is most effective if the last character appears nowhere else in the pattern. The other trick is most effective if the set of characters in the text has no intersection with the set of characters in the pattern (note that the search is certain to fail then), but useless if the set of text characters is a subset of the set of pattern characters (as, e.g., it very often is in real-life searches in apps with a small alphabet, like [ACGT}+ for genomes).
But I don't know how to characterize when the new code gets maximum benefit. It's not based on intuitive tricks. The paper that introduced it[1] says it's based on "a deep theorem on words known as the Critical Factorization Theorem due to Cesari, Duval, Vincent, and Lothaire", and I still don't fully understand why it works.
It's clear enough, though, that the new code gets real systematic value out of patterns with repetitions (like "abab"), where the current code gets real value from those only by accident (e.g., "a" and "b" happen to appear rarely in the text being searched, in which case that the pattern has repetitions is irrelevant).
But, as I said in the first message, the PR is "preliminary". There are still worlds of tweaks that have been thought of but not yet tried even once.
[1] search for "Two-Way String Matching" by Crochemore and Perrin. _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/G53VXXYW... Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>
[Guido]
Maybe someone reading this can finish the Wikipedia page on Two-Way Search? The code example trails off with a function with some incomprehensible remarks and then a TODO..
Yes, the Wikipedia page is worse than useless in its current state, although some of the references it lists are helpful. This is a much better summary: http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260 but, I believe, still far too telegraphic. The original paper is by far the best account I've seen so far, with complete code and exhaustive explanations and proofs. Even examples ;-) But here's the thing: I don't believe this algorithm this _can_ be reduced to an elevator pitch. Excruciating details appear to be fundamental at every step, and I've seen nothing yet anywhere approaching an "intuitive explanation" :-( What happens instead is that you run it on the hardest cases you can dream up, and your jaw drops in amazement :-)
On Wed, Oct 14, 2020 at 7:57 PM Tim Peters <tim.peters@gmail.com> wrote:
[Guido]
Maybe someone reading this can finish the Wikipedia page on Two-Way Search? The code example trails off with a function with some incomprehensible remarks and then a TODO..
Yes, the Wikipedia page is worse than useless in its current state, although some of the references it lists are helpful. This is a much better summary:
http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260
but, I believe, still far too telegraphic.
The original paper is by far the best account I've seen so far, with complete code and exhaustive explanations and proofs. Even examples ;-)
But here's the thing: I don't believe this algorithm this _can_ be reduced to an elevator pitch. Excruciating details appear to be fundamental at every step, and I've seen nothing yet anywhere approaching an "intuitive explanation" :-( What happens instead is that you run it on the hardest cases you can dream up, and your jaw drops in amazement :-)
That sounds very interesting! I'll definitely be digging deeper into the algorithm, papers and proofs of the underlying "Critical Factorization" theorem. My goal would be to find a good way to explain them, ideally some kind of interactive article. Ideally that would also lead to better Wikipedia pages. I'd be happy to help with this project. I've done quite a bit of relevant work while developing my fuzzysearch[1] library. - Tal Einat [1] https://pypi.org/project/fuzzysearch/
Perhaps this is a silly suggestion, but could we offer this as an external function in the stdlib rather than a string method? Leave it up to the user to decide whether or not their data best suits the find method or the new search function. It sounds like we can offer some rough heuristics, but the only way to really know is "try it and see which works best for you". The `string` module is an obvious place for it. -- Steve
On Wed, Oct 14, 2020 at 7:45 PM Steven D'Aprano <steve@pearwood.info> wrote:
Perhaps this is a silly suggestion, but could we offer this as an external function in the stdlib rather than a string method?
That feels unworkable to me. For one thing, the 'in' operator hits this same issue, doesn't it? But for another, the example initially posted where the current code did or didn't hit that quadratic breakdown was just one extra character in a long string. I cannot imagine end users knowing in advance whether their exact search will hit that unlucky circumstance that is very data dependent. If it was a case of "in certain circumstances, this other function might be twice as fast" I can see that being useful. But here we get a big-O explosion that went from milliseconds to hours on ostensibly similar operations. That said, I do recognize that the `re` module also has pathological cases, and the standard library documentation explicitly says "maybe you want to try the third-party `regex` implementation." That is sort of precedent for this approach. -- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.
[Steven D'Aprano <steve@pearwood.info>]
Perhaps this is a silly suggestion, but could we offer this as an external function in the stdlib rather than a string method?
Leave it up to the user to decide whether or not their data best suits the find method or the new search function. It sounds like we can offer some rough heuristics, but the only way to really know is "try it and see which works best for you".
The `string` module is an obvious place for it.
I think this is premature. There is almost never an optimization that's a pure win in all cases. For example, on some platforms `timsort` will never be as fast as the old samplesort in cases with a very large number of equal elements, and on all platforms `timsort` consumes more memory. And it's a wash on "random" data on most platforms. Nevertheless, it was a significant speed win for many - possibly even most - real-life cases. So far, the PR reduces the runtime in the bug report from about 3 1/2 hours to under a tenth of a second. It would be crazy not to gift users such dramatic relief by default, unless there are good reasons not to. Too soon to say. On tests of random strings with characters following a Zipf distribution (more realistic than uniform but still very easy to generate - and not contrived in any way to favor either approach), the current new code it usually faster than the status quo, in no case has been worse than twice as slow, and in a number of cases has been 20x faster. It's already clearly a win _overall_. The bulk of the "but it's slower" cases now are found with very short needles (patterns), which was expected (read my comments on the bug report), and exacerbated by that the structure of the random test generation is quite likely to create cases where a short needle is found early in the haystack. This combination makes the higher preprocessing overhead as damaging as possible. Dennis also expanded the current code's "32-bit bitmask" trick (which is an extreme simplification of Daniel Sunday's algorithm) to a fixed-size 256-byte table of character-class-specific skip counts, which went a _very_ long way toward eliminating the cases where the current code enjoyed a "pure luck" advantage over the new code. But that increased the preprocessing overhead again, which again is relatively most significant for short needles - those 256 bytes have to be initialized every time, no matter the lengths of the needle or haystack. If the new code were changed to exempt short needles, even now it might be hard to make an objective case not to adopt it. But it would leave open a different idea for an "opt-in" facility: one that allowed to give a needle to a function and get back an object capturing the results of preprocessing. Then a different wrapper around the search code that accepted the already-pre-processed info. There are, e.g., certainly cases where repeated searches for a given 4-character needle could be sped significantly by exploiting the new code, but where the overhead of preprocessing is too much to bear in "random" performance testing. It would also open the door to more aggressive (expensive) kinds of preprocessing. That's where users may be able to make useful, informed judgments. [David Mertz]
That said, I do recognize that the `re` module also has pathological cases, and the standard library documentation explicitly says "maybe you want to try the third-party `regex` implementation." That is sort of precedent for this approach.
`regex` also suffers exponential time disasters, they're just harder to stumble into - and there are books explaining in detail how and why regex engines can fall into these traps. It's far less _expected_ in plain string searches, and Dennis was aware of the new algorithms because, apparently, (at least) glibc's memmem switched to one some years ago. So we're playing catch-up here.
On Thu, Oct 15, 2020 at 11:38 AM Tim Peters <tim.peters@gmail.com> wrote:
I think this is premature. There is almost never an optimization that's a pure win in all cases. For example, on some platforms `timsort` will never be as fast as the old samplesort in cases with a very large number of equal elements, and on all platforms `timsort` consumes more memory. And it's a wash on "random" data on most platforms. Nevertheless, it was a significant speed win for many - possibly even most - real-life cases.
And timsort is one of my go-tos for teaching the concept of hybrid sorting algorithms, because, at its heart, it's simple enough to explain, and it manages to be so incredibly valuable in real-world code. :)
But it would leave open a different idea for an "opt-in" facility: one that allowed to give a needle to a function and get back an object capturing the results of preprocessing. Then a different wrapper around the search code that accepted the already-pre-processed info. There are, e.g., certainly cases where repeated searches for a given 4-character needle could be sped significantly by exploiting the new code, but where the overhead of preprocessing is too much to bear in "random" performance testing. It would also open the door to more aggressive (expensive) kinds of preprocessing. That's where users may be able to make useful, informed judgments.
Kinda like the way a compiled regex is used? I like this idea. So it'd be heuristics in the core language that choose a good default for most situations, and then a str method that returns a preprocessed needle. In a Python implementation that doesn't want to use two different algorithms, that preprocessor could return the string unchanged, but otherwise it's an opaque object usable only in string searches. +1, if my interpretation of your description is correct. ChrisA
On Wed, Oct 14, 2020 at 9:56 AM Tim Peters <tim.peters@gmail.com> wrote:
[Guido]
Maybe someone reading this can finish the Wikipedia page on Two-Way Search? The code example trails off with a function with some incomprehensible remarks and then a TODO..
Yes, the Wikipedia page is worse than useless in its current state, although some of the references it lists are helpful. This is a much better summary:
http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260
but, I believe, still far too telegraphic.
The key seems to be: """ The searching phase of the Two Way algorithm consists in first comparing the character of xr from left to right, then the character of x from right to left. When a mismatch occurs when scanning the k-th character of xr, then a shift of length k is performed. When a mismatch occurs when scanning x, or when an occurrence of the pattern is found, then a shift of length per(x) is performed. Such a scheme leads to a quadratic worst case algorithm, this can be avoided by a prefix memorization: when a shift of length per(x) is performed the length of the matching prefix of the pattern at the beginning of the window (namely m-per(x)) after the shift is memorized to avoid to scan it again during the next attempt. """ The preprocessing comes down to splitting the original search string ("needle") in two parts, xl and xr, using some clever algorithm (IIUC the wikipedia page does describe this, although my brain is currently too addled to visualize it). The original paper is by far the best account I've seen so far, with
complete code and exhaustive explanations and proofs. Even examples ;-)
I need a Python version though.
But here's the thing: I don't believe this algorithm this _can_ be reduced to an elevator pitch. Excruciating details appear to be fundamental at every step, and I've seen nothing yet anywhere approaching an "intuitive explanation" :-( What happens instead is that you run it on the hardest cases you can dream up, and your jaw drops in amazement :-)
I am not able to dream up any hard cases -- like other posters, my own use of substring search is usually looking for a short string in a relatively short piece of text. I doubt even the current optimizations matter to my uses. What are typical hard cases used for? DNA search? (That would be cool!) -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>
On Wed., Oct. 14, 2020, 17:37 Tim Peters, <tim.peters@gmail.com> wrote:
[Steven D'Aprano <steve@pearwood.info>]
Perhaps this is a silly suggestion, but could we offer this as an external function in the stdlib rather than a string method?
Leave it up to the user to decide whether or not their data best suits the find method or the new search function. It sounds like we can offer some rough heuristics, but the only way to really know is "try it and see which works best for you".
The `string` module is an obvious place for it.
I think this is premature. There is almost never an optimization that's a pure win in all cases. For example, on some platforms `timsort` will never be as fast as the old samplesort in cases with a very large number of equal elements, and on all platforms `timsort` consumes more memory. And it's a wash on "random" data on most platforms. Nevertheless, it was a significant speed win for many - possibly even most - real-life cases.
So far, the PR reduces the runtime in the bug report from about 3 1/2 hours to under a tenth of a second. It would be crazy not to gift users such dramatic relief by default, unless there are good reasons not to. Too soon to say. On tests of random strings with characters following a Zipf distribution (more realistic than uniform but still very easy to generate - and not contrived in any way to favor either approach), the current new code it usually faster than the status quo, in no case has been worse than twice as slow, and in a number of cases has been 20x faster. It's already clearly a win _overall_.
The bulk of the "but it's slower" cases now are found with very short needles (patterns), which was expected (read my comments on the bug report), and exacerbated by that the structure of the random test generation is quite likely to create cases where a short needle is found early in the haystack. This combination makes the higher preprocessing overhead as damaging as possible. Dennis also expanded the current code's "32-bit bitmask" trick (which is an extreme simplification of Daniel Sunday's algorithm) to a fixed-size 256-byte table of character-class-specific skip counts, which went a _very_ long way toward eliminating the cases where the current code enjoyed a "pure luck" advantage over the new code. But that increased the preprocessing overhead again, which again is relatively most significant for short needles - those 256 bytes have to be initialized every time, no matter the lengths of the needle or haystack.
If the new code were changed to exempt short needles, even now it might be hard to make an objective case not to adopt it.
This is where it sounds like we might want to go. If we can figure out a reasonable, cheap heuristic to define "too short"-- for either the search string or the string to search -- to use the fancier algorithm then I would support adding the fancy string search. -Brett
But it would leave open a different idea for an "opt-in" facility: one that allowed to give a needle to a function and get back an object capturing the results of preprocessing. Then a different wrapper around the search code that accepted the already-pre-processed info. There are, e.g., certainly cases where repeated searches for a given 4-character needle could be sped significantly by exploiting the new code, but where the overhead of preprocessing is too much to bear in "random" performance testing. It would also open the door to more aggressive (expensive) kinds of preprocessing. That's where users may be able to make useful, informed judgments.
[David Mertz]
That said, I do recognize that the `re` module also has pathological cases, and the standard library documentation explicitly says "maybe you want to try the third-party `regex` implementation." That is sort of precedent for this approach.
`regex` also suffers exponential time disasters, they're just harder to stumble into - and there are books explaining in detail how and why regex engines can fall into these traps.
It's far less _expected_ in plain string searches, and Dennis was aware of the new algorithms because, apparently, (at least) glibc's memmem switched to one some years ago. So we're playing catch-up here. _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ECPXBBF7... Code of Conduct: http://python.org/psf/codeofconduct/
[Guido]
The key seems to be:
Except none of that quoted text (which I'll skip repeating) gives the slightest clue as to _why_ it may be an improvement. So you split the needle into two pieces. So what? What's the _point_? Why would someone even imagine that might help? Why is one half then searched right to left, but the other half left to right? There's actually "a reason" for searching the right half left to right, but because the shift on a mismatch in the left half is a constant ("per(x)") independent of the mismatching position, it's actually irrelevant in which order the left half's characters are compared (well, at least in some variations of these newer algorithms). Over-specification can be actively misleading too. What's the key idea? "Split the needle into two pieces" is a key _step_, not a key _insight_.
I need a Python version though.
Dennis wrote one for his first stab, which you can download from the bug report (it's attached there as fastsearch.py). On the bug report's data, running that under CPython 3.9,0 on my box reduced the time from about 3 1/2 hours to about 20 seconds. Running it under PyPy, to under one second (and the current C implementation under 0.1 second).
I am not able to dream up any hard cases -- like other posters, my own use of substring search is usually looking for a short string in a relatively short piece of text. I doubt even the current optimizations matter to my uses.
They probably do, but not dramatically. Would you, e.g., notice a factor of 2 either way in those cases? A factor of 4? Of 10? When Fredrik first wrote this code, he routinely saw factors of 2 to 10 speedup on all sorts of string-searching benchmarks, both contrived and "natural". The speedups were especially pronounced for Unicode strings at the time, where skipping futile Unicode character comparisons could be more valuable than when skipping (shorter) single-byte comparisons. Should also note that the fixed string searching code is also used as a subroutine by parts of our regexp implementation, by str.count(), str.replace(), similar `bytes` operations, and so on.
What are typical hard cases used for?
It's kinda like asking what typical hard rounding cases for pow() are used for ;-) They aren't deliberate. They "just happen", typically when a pattern and the text to search contain self-similarities "but with glitches". Search the string text = "X" * 10_000_000 for needle = "X" * 1_000_000 Easy! It matches at once. But now tack "Y" on to the end of the needle. Then it's impossible to match. Brute force search first finds a prefix match at text{; 1000000] but then fails to match the trailing "Y". So brute force uses another million compares to match the prefix at text[1 : 1000001]. But again fails to match the trailing "Y". Then another million plus 1 compares to fail to match starting at index 2, and again at index 3, and again at index 4, ... it approaches a trillion comparisons before it finally fails. The current code starts comparing at the right end of each trial position first. Then an "X" from the text and the needle's trailing "Y" mismatch at once. That's a million times faster. Here's a simple case where the current code is horribly slow, because neither "right-to-left" nor any of its preprocessing tricks do any good at all. Even Daniel Sunday's full-blown algorithm[1] (which the current code strips down _almost_ to the point of non-existence) wouldn't help (unless it used a variant I've never actually seen that compared the rarest pattern character first): ("X" * 10_000_000).find("X" * 500_000 + "Y" + "X" * 500_000) The newer code returns -1 seemingly instantly.
DNA search? (That would be cool!)
It can certainly help. Self-similarities are bound to be common in long strings from a 4-character alphabet (ACGT). But, to be fair, serious work with genomes typically invests in building a giant suffix tree (or enhanced suffix array) for each cataloged sequence to be searched. No preprocessing of needles is needed then to guarantee worst-case efficient searches of many kinds (including things like finding the longest adjacent repeated subsequences, more like a regexp (.+)\1 kind of search). But the new code could quite plausibly speed more casual prototyping of such explorations. [1] https://dl.acm.org/doi/abs/10.1145/79173.79184
Here's my attempt at some heuristic motivation: Try to construct a needle that will perform as poorly as possible when using the naive two-nested-for-loops algorithm. You'll find that if there isn't some sort of vague periodicity in your needle, then you won't ever get *that* unlucky; each particular alignment will fail early, and if it doesn't then some future alignment would be pigeonholed to fail early. So Crochemore and Perrin's algorithm explicitly handles this "worst case" of periodic strings. Once we've identified in the haystack some period from the needle, there's no need to re-match it. We can keep a memory of how many periods we currently remember matching up, and never re-match them. This is what gives the O(n) behavior for periodic strings. But wait! There are some bad needles that aren't quite periodic. For instance: >>> 'ABCABCAABCABC' in 'ABC'*1_000_000 The key insight though is that the worst strings are still "periodic enough", and if we have two different patterns going on, then we can intentionally split them apart. For example, `"xyxyxyxyabcabc" --> "xyxyxyxy" + "abcabc"`. I believe the goal is to line it up so that if the right half matches but not the left then we can be sure to skip somewhat far ahead. This might not correspond exactly with splitting up two patterns. This is glossing over some details that I'm admittedly still a little hazy on as well, but hopefully that gives at least a nudge of intuition.
[Dennis Sweeney <sweeney.dennis650@gmail.com>]
Here's my attempt at some heuristic motivation:
Thanks, Dennis! It helps. One gloss:
.... The key insight though is that the worst strings are still "periodic enough", and if we have two different patterns going on, then we can intentionally split them apart.
The amazing (to me) thing is that splitting into JUST two parts is always enough to guarantee linearity. What if there are a million different patterns going on ? Doesn't matter! I assume this remarkable outcome is a consequence of the Critical Factorization Theorem.
[Guido]
I am not able to dream up any hard cases -- like other posters, my own use of substring search is usually looking for a short string in a relatively short piece of text. I doubt even the current optimizations matter to my uses.
I should have responded to this part differently. What I said was fine ;-) , but it's a mistake to focus exclusively on pathologies here. What Fredrik did was with the aim of significantly speeding utterly ordinary searches. For example, search for "xyz" in "abcdefgh...xyz". Brute force starts comparing at the first location: abcdefgh... xyz The current code compares "c" with "z" fist. They don't match. Now what? It looks at the_next_ character in the haystack, "d". Thanks to preprocessing the needle, it knows that "d" appears nowhere in the needle (actually, the code is so keen on "tiny constant extra space" that preprocessing only saves enough info to get a _probabilitistic_ guess about whether "d" is in the needle, but one that's always correct when "not in the needle" is its guess). For that reason, there's no possible match when starting the attempt at _any_ position that leaves "d" overlapping with the needle. So it can immediately skip even trying starting at text indices 1, 2, or 3. It can jump immediately to try next at index 4: abcdefgh... 0123xyz Then the same kind of thing, seeing that "g" and "z" don't match, and that "h" isn't in the needle. And so on, jumping 4 positions at a time until finally hitting "xyz" at the end of the haystack. The new code gets similar kinds of benefits in _many_ similarly ordinary searches too, but they're harder to explain because they're not based on intuitive tricks explicitly designed to speed ordinary cases. They're more happy consequences of making pathological cases impossible. From randomized tests so far, it's already clear that the new code is finding more of this nature to exploit in non-pathological cases than the current code. Although that's partly (but only partly) a consequence of Dennis augmenting the new algorithm with a more powerful version of the specific trick explained above (which is an extreme simplification of Daniel Sunday's algorithm, which was also aimed at speeding ordinary searches).
Excuse me if I intrude in an algorithm that I have not understood, but the new optimization can be applied to regexps too?
[Marco Sulla]
Excuse me if I intrude in an algorithm that I have not understood, but the new optimization can be applied to regexps too?
The algorithm is limited to searching for fixed strings. However, _part_ of our regexp implementation (the bit that looks ahead for a fixed string) will inherit it by magic.
I don't plan on making a series of these posts, just this one, to give people _some_ insight into why the new algorithm gets systematic benefits the current algorithm can't. It splits the needle into two pieces, u and v, very carefully selected by subtle linear-time needle preprocessing (and it's quite a chore to prove that a suitable splitting always exists!): needle == u + v There are a whole bunch of things that can be said about this splitting (and must be said to prove worst-case linear time), but here's just one for which it's easy to show the benefit: no (non-empty) prefix of v is a suffix of u. Now I think it's quite easy to twist your head into knots trying to see what follows from that, so I'll just give a highly generalizable concrete example. Suppose v starts with "xyza". Then u cannot end with "x", "xy", "xyz", or "xyza". If u did, then u would have a suffix that's also a prefix of v. A match attempt at a given haystack position starts by matching the characters in v one at a time, left to right. Here with haystack on the first line, needle on the second, and "0" denoting "don't care / unknown / doesn't exist" - it's just _some_ glyph so the example has some chance of lining up under annoying proportional fonts :-( Say the first character matches: 00000x000000 0uuuuxyzavvv And the second and third: 00000xyz0000 0uuuuxyzavvv But the 4th doesn't: 00000xyzZ000 0uuuuxyzavvv Is there any possibility of a match if we shift the needle one to the right? No! If we tried, the last character of u would line up with the "x" in the haystack, and we already know "x" is not a suffix of u. How about if we shifted two to the right? No again! And for the same reason: then the last two characters of u would line up with "xy" in the haystack, but we already know "xy" is not a suffix of u. And so on. In general, if we find a mismatch while trying the k'th (1-based) character of v, we can shift the needle k places to the right before there's any possibility of matching. In this case, k=4, so we can jump to: 00000xyzA0000000 00000uuuuxyzavvv Note that no "extra" storage is needed to exploit this. No character lookups, no extra expenses in time or space of any kind. Just "if we mismatch on the k'th try, we can jump ahead k positions". Nothing special about "xyza" either - this works with any characters at all! "xyza" isn't by any stretch a hard case regardless. But it goes _some_ way toward hinting at why potentially hard cases are rendered toothless too. For example, the needle "x" * 100 is split into u = "" # yup, empty! v = needle Now, e.g., if we find a mismatch at the 83'rd character of v, we can leap ahead 83 positions immediately. What about needle "x" * 100 + "y"? I picked that to crush your premature optimism ;-) Now the splitting is u = "x" * 100 v = "y" and the wonderful trick above only applies to a trivial 1-character string. The "hard part" is all in matching the u piece now, which I'll leave as an exercise for the reader ;-)
On Sat, Oct 17, 2020 at 12:30 PM Tim Peters <tim.peters@gmail.com> wrote:
I don't plan on making a series of these posts, just this one, to give people _some_ insight into why the new algorithm gets systematic benefits the current algorithm can't. It splits the needle into two pieces, u and v, very carefully selected by subtle linear-time needle preprocessing (and it's quite a chore to prove that a suitable splitting always exists!):
needle == u + v
[... more details...]
Thank you, great explanation. Can this be added to the source code if/when this algorithm gets implemented? I've learned all kinds of things from the huge block comments about timsort, list growth, and the like. They make great reading when you're trying to figure out how to explain things (or if you're a bored nerd browsing for curiosity's sake - that works too!). ChrisA
[Tim Peters, explains one of the new algorithm's surprisingly effective moving parts] [Chris Angelico <rosuav@gmail.com>]
Thank you, great explanation. Can this be added to the source code if/when this algorithm gets implemented?
No ;-) While I enjoy trying to make hard things clear(er), I need to understand them inside out myself first, and I'm still not there with this algorithm. Even of what I do grok now, I cheated some to make that post simple. For example, if you read it carefully, you'll realize that the explanation I gave doesn't actually explain why the "x" * 100 example is correct: the explanation implicitly relied on that the left half the splitting (u) is at least as long as the right half (v). But that's not always the case (and, in particular, for "x" * 100, u is empty!). Tal Einat posted earlier that he was keen to try to work up a clear explanation, and I look forward to that. All the expositions I've found of this algorithm so far are hard to approach. The original paper is still the best I've seen. Everything I wrote was but a gloss on its Lemma 3.2.
I've learned all kinds of things from the huge block comments about timsort, list growth, and the like. They make great reading when you're trying to figure out how to explain things (or if you're a bored nerd browsing for curiosity's sake - that works too!).
Takes one to know one ;-) There's beauty and elegance in this algorithm, but they'll go unloved without explanations clear enough for the non-obsessed to follow.
On Fri, 16 Oct 2020 20:24:22 -0500 Tim Peters <tim.peters@gmail.com> wrote:
Note that no "extra" storage is needed to exploit this. No character lookups, no extra expenses in time or space of any kind. Just "if we mismatch on the k'th try, we can jump ahead k positions".
Ok, so that means that on a N-character haystack, it'll always do at least N character comparisons? IIRC, the current algorithm is able to do much less than that in favorable cases. If the needle is k characters long and they are all distinct, it's able to do N/k comparisons. Regards Antoine.
[Tim]
Note that no "extra" storage is needed to exploit this. No character lookups, no extra expenses in time or space of any kind. Just "if we mismatch on the k'th try, we can jump ahead k positions".
[Antoine Pitrou <solipsis@pitrou.net>]
Ok, so that means that on a N-character haystack, it'll always do at least N character comparisons?
IIRC, the current algorithm is able to do much less than that in favorable cases. If the needle is k characters long and they are all distinct, it's able to do N/k comparisons.
It's an excellent question, and from what I said it's a correct inference. But I only described how the algorithm matches the "v" part of the "needle = u + v" splitting. When matching the "u" part, skips materially exceeding the count of comparisons made are possible. The paper claims the minimal number of comparisons needed (ignoring preprocessing) is 2N/k, so same order as the current algorithm. But, for the constant factor, Dennis's code achieves N/k, because he augmented the Crochemre-Perrin algorithm with a variant of Sunday's algorithm (a simplification, but not as extreme as the current code's simplification). Note that best case near N/k is a known lower bound for best cases - impossible to do materially better. In the 'x' * 100 + 'y' example I ended my msg with, recall that v wa just "y", so the part I described was of minimum value - if the last haystack character in the current search window isn't "y", that the mismatch occurred on the first try leads to a shift of just 1. But if the character one beyond the end of the current search window isn't "x" or "y" then, Sunday's algorithm allows to skip 102 positions (len(needle) + 1). His algorithm requires precomputing a skip-count table of O(len(sigma)) space, where sigma is the universe of possible characters ("the alphabet"). That's "a whole lot" for Unicode. Fredrik and Dennis simplified Sunday's algorithm to weaker versions that require relatively small constant space instead, independent of needle, pattern, and alphabet size. Despite the current code's simplification of that nearly to the point of non-existence (it reduces the space needed to 32 bits, and with only one possible skip count), it's nevertheless so effective that it beat Dennis's initially pure Crochemre-Perrin code by dramatic margins in a significant number of exceptionally lucky (for the current code) cases. After adding a variation of Sunday (the current PR uses 256 bytes, and has 64K possible skip counts - but perhaps other tradeoffs would work better), we haven't yet seen a case (contrived or natural) where the current code is dramatically faster. But we have seen non-contrived cases where the current PR is dramatically faster, including in the benchmark code Fredrik gifted us with (Tools/stringbench/stringbench.py in your Python tree). Alas, the higher preprocessing costs leave the current PR slower in "too many" cases too, especially when the needle is short and found early in the haystack. Then any preprocessing cost approaches a pure waste of time.
[Tim]
... Alas, the higher preprocessing costs leave the current PR slower in "too many" cases too, especially when the needle is short and found early in the haystack. Then any preprocessing cost approaches a pure waste of time.
But that was this morning. Since then, Dennis changed the PR to back off to the current code when the needle is "too small". There are very few cases we know of now where the PR code is slower at all than the current code, none where it's dramatically slower, many where it's significantly faster, and some non-contrived cases where it's dramatically faster (for example, over a factor of 10 in stringbench.py's "late match, 100 characters" forward-search tests, and especially beneficial for Unicode (as opposed to bytes)). Then there are the pathological cases like in the original issue report, where it's multiple orders of magnitude faster (about 3 1/2 hours to under a tenth of a second in the issue report's case). Still waiting for someone who thinks string search speed is critical in their real app to give it a try. In the absence of that, I endorse merging this.
On 17/10/20 3:26 pm, Tim Peters wrote:
Tal Einat posted earlier that he was keen to try to work up a clear explanation, and I look forward to that. All the expositions I've found of this algorithm so far are hard to approach.
Maybe Mathologer or 3blue1brown could be persuaded to help? They seem to have a knack for making tricky stuff understandable. -- Greg
participants (13)
-
Antoine Pitrou
-
Brett Cannon
-
Chris Angelico
-
David Mertz
-
Dennis Sweeney
-
Greg Ewing
-
Guido van Rossum
-
Larry Hastings
-
Marco Sulla
-
Raymond Hettinger
-
Steven D'Aprano
-
Tal Einat
-
Tim Peters