On Wed, Oct 14, 2020 at 9:56 AM Tim Peters <tim.peters@gmail.com> wrote:
[Guido]
Maybe someone reading this can finish the Wikipedia page on Two-Way Search? The code example trails off with a function with some incomprehensible remarks and then a TODO..
Yes, the Wikipedia page is worse than useless in its current state, although some of the references it lists are helpful. This is a much better summary:
http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260
but, I believe, still far too telegraphic.
The key seems to be: """ The searching phase of the Two Way algorithm consists in first comparing the character of xr from left to right, then the character of x from right to left. When a mismatch occurs when scanning the k-th character of xr, then a shift of length k is performed. When a mismatch occurs when scanning x, or when an occurrence of the pattern is found, then a shift of length per(x) is performed. Such a scheme leads to a quadratic worst case algorithm, this can be avoided by a prefix memorization: when a shift of length per(x) is performed the length of the matching prefix of the pattern at the beginning of the window (namely m-per(x)) after the shift is memorized to avoid to scan it again during the next attempt. """ The preprocessing comes down to splitting the original search string ("needle") in two parts, xl and xr, using some clever algorithm (IIUC the wikipedia page does describe this, although my brain is currently too addled to visualize it). The original paper is by far the best account I've seen so far, with
complete code and exhaustive explanations and proofs. Even examples ;-)
I need a Python version though.
But here's the thing: I don't believe this algorithm this _can_ be reduced to an elevator pitch. Excruciating details appear to be fundamental at every step, and I've seen nothing yet anywhere approaching an "intuitive explanation" :-( What happens instead is that you run it on the hardest cases you can dream up, and your jaw drops in amazement :-)
I am not able to dream up any hard cases -- like other posters, my own use of substring search is usually looking for a short string in a relatively short piece of text. I doubt even the current optimizations matter to my uses. What are typical hard cases used for? DNA search? (That would be cool!) -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>