[Python-Dev] Re: Changing Python's string search algorithms

14 Oct 2020


      Maybe someone reading this can finish the Wikipedia page on Two-Way Search?
The code example trails off with a function with some incomprehensible
remarks and then a TODO...

On Wed, Oct 14, 2020 at 9:07 AM Tim Peters <tim.peters@gmail.com> wrote:
...
Rest assured that Dennis is aware of that pragmatics may change for
shorter needles.
The code has always made a special-case of 1-character needles,
because it's impossible "even in theory" to improve over
straightforward brute force search then.
Say the length of the text to search is `t`, and the length of the
pattern `p`. Brute force and the current code have worst case O(t*p)
behavior. The new code, worst case O(t+p). If that's as deep as you
dig into it, it seems all but obvious then that O(t*p) just can't be
that bad when p is small, so keep it simple.
But there's another twist:  the current and new code both have O(t/p)
cases too (but brute force, and even Knuth-Morris-Pratt, don't). That
can be highly beneficial even for p as small as 3.
Unfortunately, the exact cases in which the current and new code enjoy
O(t/p) behavior aren't the same.
Lying a bit:  In general the current code has just two tricks. One of
those tricks is useless (pure waste of time) if the pattern happens to
end with a repeated character pair, and is most effective if the last
character appears nowhere else in the pattern. The other trick is most
effective if the set of characters in the text has no intersection
with the set of characters in the pattern (note that the search is
certain to fail then), but useless if the set of text characters is a
subset of the set of pattern characters (as, e.g., it very often is in
real-life searches in apps with a small alphabet, like [ACGT}+ for
genomes).
But I don't know how to characterize when the new code gets maximum
benefit. It's not based on intuitive tricks.  The paper that
introduced it[1] says it's based on "a deep theorem on words
known as the Critical Factorization Theorem due to Cesari, Duval,
Vincent, and Lothaire", and I still don't fully understand why it
works.
It's clear enough, though, that the new code gets real systematic
value out of patterns with repetitions (like "abab"), where the
current code gets real value from those only by accident (e.g., "a"
and "b" happen to appear rarely in the text being searched, in which
case that the pattern has repetitions is irrelevant).
But, as I said in the first message, the PR is "preliminary". There
are still worlds of tweaks that have been thought of but not yet tried
even once.
[1] search for "Two-Way String Matching" by Crochemore and Perrin.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at
https://mail.python.org/archives/list/python-dev@python.org/message/G53VXXYW...
Code of Conduct: http://python.org/psf/codeofconduct/
-- 
--Guido van Rossum (python.org/~guido)
*Pronouns: he/him **(why is my pronoun here?)*
<http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>