[Python-Dev] Re: Changing Python's string search algorithms

15 Oct 2020

      [Guido]
...
The key seems to be:
Except none of that quoted text (which I'll skip repeating) gives the
slightest clue as to _why_ it may be an improvement.  So you split the
needle into two pieces.  So what?  What's the _point_?  Why would
someone even imagine that might help?

Why is one half then searched right to left, but the other half left
to right?  There's actually "a reason" for searching the right half
left to right, but because the shift on a mismatch in the left half is
a constant ("per(x)") independent of the mismatching position, it's
actually irrelevant in which order the left half's characters are
compared (well, at least in some variations of these newer
algorithms).  Over-specification can be actively misleading too.

What's the key idea?  "Split the needle into two pieces" is a key
_step_, not a key _insight_.
...
I need a Python version though.
Dennis wrote one for his first stab, which you can download from the
bug report (it's attached there as fastsearch.py).

On the bug report's data, running that under CPython 3.9,0 on my box
reduced the time from about 3 1/2 hours to about 20 seconds.  Running
it under PyPy, to under one second (and the current C implementation
under 0.1 second).
...
I am not able to dream up any hard cases -- like other posters, my own
use of substring search is usually looking for a short string in a relatively
short piece of text. I doubt even the current optimizations matter to my uses.
They probably do, but not dramatically.  Would you, e.g., notice a
factor of 2 either way in those cases?  A factor of 4?  Of 10?  When
Fredrik first wrote this code, he routinely saw factors of 2 to 10
speedup on all sorts of string-searching benchmarks, both contrived
and "natural". The speedups were especially pronounced for Unicode
strings at the time, where skipping futile Unicode character
comparisons could be more valuable than when skipping (shorter)
single-byte comparisons.

Should also note that the fixed string searching code is also used as
a subroutine by parts of our regexp implementation, by str.count(),
str.replace(), similar `bytes` operations, and so on.
...
What are typical hard cases used for?
It's kinda like asking what typical hard rounding cases for pow() are
used for ;-) They aren't deliberate. They "just happen", typically
when a pattern and the text to search contain self-similarities "but
with glitches".   Search the string

   text =  "X" * 10_000_000

for

    needle = "X" * 1_000_000

Easy!  It matches at once.  But now tack "Y" on to the end of the
needle.  Then it's impossible to match.  Brute force search first
finds a prefix match at text{; 1000000] but then fails to match the
trailing "Y".  So brute force uses another million compares to match
the prefix at text[1 : 1000001]. But again fails to match the trailing
"Y".  Then another million plus 1 compares to fail to match starting
at index 2, and again at index 3, and again at index 4, ... it
approaches a trillion comparisons before it finally fails.

The current code starts comparing at the right end of each trial
position first. Then an "X" from the text and the needle's trailing
"Y" mismatch at once.  That's a million times faster.

Here's a simple case where the current code is horribly slow, because
neither "right-to-left" nor any of its preprocessing tricks do any
good at all.  Even Daniel Sunday's full-blown algorithm[1] (which the
current code strips down _almost_ to the point of non-existence)
wouldn't help (unless it used a variant I've never actually seen that
compared the rarest pattern character first):

("X" * 10_000_000).find("X" * 500_000 + "Y" + "X" * 500_000)

The newer code returns -1 seemingly instantly.
...
DNA search? (That would be cool!)
It can certainly help. Self-similarities are bound to be common in
long strings from  a 4-character alphabet (ACGT). But, to be fair,
serious work with genomes typically invests in building a giant suffix
tree (or enhanced suffix array) for each cataloged sequence to be
searched. No preprocessing of needles is needed then to guarantee
worst-case efficient searches of many kinds (including things like
finding the longest adjacent repeated subsequences, more like a regexp
(.+)\1 kind of search).

But the new code could quite plausibly speed more casual prototyping
of such explorations.

[1] https://dl.acm.org/doi/abs/10.1145/79173.79184

[Python-Dev] Re: Changing Python's string search algorithms

Tim Peters