
Maybe someone reading this can finish the Wikipedia page on Two-Way Search? The code example trails off with a function with some incomprehensible remarks and then a TODO... On Wed, Oct 14, 2020 at 9:07 AM Tim Peters <tim.peters@gmail.com> wrote:
Rest assured that Dennis is aware of that pragmatics may change for shorter needles.
The code has always made a special-case of 1-character needles, because it's impossible "even in theory" to improve over straightforward brute force search then.
Say the length of the text to search is `t`, and the length of the pattern `p`. Brute force and the current code have worst case O(t*p) behavior. The new code, worst case O(t+p). If that's as deep as you dig into it, it seems all but obvious then that O(t*p) just can't be that bad when p is small, so keep it simple.
But there's another twist: the current and new code both have O(t/p) cases too (but brute force, and even Knuth-Morris-Pratt, don't). That can be highly beneficial even for p as small as 3.
Unfortunately, the exact cases in which the current and new code enjoy O(t/p) behavior aren't the same.
Lying a bit: In general the current code has just two tricks. One of those tricks is useless (pure waste of time) if the pattern happens to end with a repeated character pair, and is most effective if the last character appears nowhere else in the pattern. The other trick is most effective if the set of characters in the text has no intersection with the set of characters in the pattern (note that the search is certain to fail then), but useless if the set of text characters is a subset of the set of pattern characters (as, e.g., it very often is in real-life searches in apps with a small alphabet, like [ACGT}+ for genomes).
But I don't know how to characterize when the new code gets maximum benefit. It's not based on intuitive tricks. The paper that introduced it[1] says it's based on "a deep theorem on words known as the Critical Factorization Theorem due to Cesari, Duval, Vincent, and Lothaire", and I still don't fully understand why it works.
It's clear enough, though, that the new code gets real systematic value out of patterns with repetitions (like "abab"), where the current code gets real value from those only by accident (e.g., "a" and "b" happen to appear rarely in the text being searched, in which case that the pattern has repetitions is irrelevant).
But, as I said in the first message, the PR is "preliminary". There are still worlds of tweaks that have been thought of but not yet tried even once.
[1] search for "Two-Way String Matching" by Crochemore and Perrin. _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/G53VXXYW... Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>