[Python-Dev] Re: Changing Python's string search algorithms

17 Oct 2020

      [Tim]
...
...
Note that no "extra" storage is needed to exploit this. No character
lookups, no extra expenses in time or space of any kind.  Just "if we
mismatch on the k'th try, we can jump ahead k positions".
[Antoine Pitrou ]
...
Ok, so that means that on a N-character haystack, it'll always do at
least N character comparisons?
IIRC, the current algorithm is able to do much less than that in
favorable cases.  If the needle is k characters long and they are all
distinct, it's able to do N/k comparisons.
It's an excellent question, and from what I said it's a correct inference.

But I only described how the algorithm matches the "v" part of the
"needle = u + v" splitting.  When matching the "u" part, skips
materially exceeding the count of comparisons made are possible.

The paper claims the minimal number of comparisons needed (ignoring
preprocessing) is 2N/k, so same order as the current algorithm. But,
for the constant factor, Dennis's code achieves N/k, because he
augmented the Crochemre-Perrin algorithm with a variant of Sunday's
algorithm (a simplification, but not as extreme as the current code's
simplification). Note that best case near N/k is a known lower bound
for best cases - impossible to do materially better.

In the 'x' * 100 + 'y' example I ended my msg with, recall that  v wa
just "y", so the part I described was of minimum value - if the last
haystack character in the current search window isn't "y", that the
mismatch occurred on the first try leads to a shift of just 1.

But if the character one beyond the end of the current search window
isn't "x" or "y" then, Sunday's algorithm allows to skip 102 positions
(len(needle) + 1).  His algorithm requires precomputing a skip-count
table of O(len(sigma)) space, where sigma is the universe of possible
characters ("the alphabet").  That's "a whole lot" for Unicode.

Fredrik and Dennis simplified Sunday's algorithm to weaker versions
that require relatively small constant space instead, independent of
needle, pattern, and alphabet size.

Despite the current code's simplification of that nearly to the point
of non-existence (it reduces the space needed to 32 bits, and with
only one possible skip count), it's nevertheless so effective that it
beat Dennis's initially pure Crochemre-Perrin code by dramatic margins
in a significant number of exceptionally lucky (for the current code)
cases.  After adding a variation of Sunday (the current PR uses 256
bytes, and has 64K possible skip counts - but perhaps other tradeoffs
would work better), we haven't yet seen a case (contrived or natural)
where the current code is dramatically faster.

But we have seen non-contrived cases where the current PR is
dramatically faster, including in the benchmark code Fredrik gifted us
with (Tools/stringbench/stringbench.py in your Python tree).  Alas,
the higher preprocessing costs leave the current PR slower in "too
many" cases too, especially when the needle is short and found early
in the haystack. Then any preprocessing cost approaches a pure waste
of time.

[Python-Dev] Re: Changing Python's string search algorithms

Tim Peters