[Python-Dev] Changing Python's string search algorithms

14 Oct 2020

      Fredrik Lundh crafted our current string search algorithms, and
they've served us very well.  They're nearly always as fast as
dumbest-possible brute force search, and sometimes much faster. This
was bought with some very cheap one-pass preprocessing of the pattern
(the substring to search _for_), to compute a single integer "skip",
and a single 32-bit bitmask, which can sometimes be used to skip over
areas where it can prove matches are doomed to fail.

Unlike most "fancier than brute" algorithms, it does not build
preprocessing tables of size proportional to the length of the
pattern, or to the size of the alphabet. The extra space is O(1), and
tiny, independent of pattern and alphabet size.

A weakness is that worst-case search time is proportional to the
product of the lengths of the inputs. This can be so for searches that
succeed as well as for those that fail.  This was recently
dramatically illustrated by this bug report:

    https://bugs.python.org/issue41972

where a single large string search consumed well over 3 hours of CPU
time before it succeeded. It "should take" a fraction of a second.

While it was news to me, Dennis Sweeney was aware of that other
algorithms requiring only O(1) extra space are now well known with
worst-case linear time. That's really quite remarkable :-) They do
require fancier (but still linear-time) preprocessing, so can't always
be a pure win.

So this message: string (which includes byte) search is an extremely
common operation, across all kinds of applications, so changes here
need as many eyeballs on them as they can get. What's typical? What's
important? What doesn't matter? What about reverse searches ("rfind")?
How can we quantify the acceptable region of tradeoffs?

My guess is that searches for substrings less than 50 characters long
in strings under ten thousand characters are most important for most
apps, but I just pulled that guess out of my butt. I'm also guessing
that searches for multi-million-character substrings (like in the bug
report above) are rare indeed, and that any change sparing them from
quadratic-time disaster would be "way more than good enough".

But I don't know. If you can help, Dennis already has a preliminary PR
implementing one of the newer algorithms, augmented with Fredrik's
unique ideas:

    https://github.com/python/cpython/pull/22679

By "help" I don't necessarily mean code review - it could, e.g., be a
huge help to test-drive it on your own critical string-slinging apps.
Faster? Slower? A wash? Wrong answer? Crash?

Due to the natures of the old and new algorithms, neither will be
faster or slower in all cases, the newer one should never be
dramatically slower than the old one, the new one will be
spectacularly faster in some cases, and "in general" it seems
impossible to guess in advance. It depends on the value the fancier
preprocessing can extract from the pattern versus the higher
preprocessing cost, and that in turn depends on details of exactly
what the patterns and texts to be searched are.

[Python-Dev] Changing Python's string search algorithms

Tim Peters