[Python-Dev] Re: Changing Python's string search algorithms

15 Oct 2020

      [Steven D'Aprano <steve@pearwood.info>]
...
Perhaps this is a silly suggestion, but could we offer this as an
external function in the stdlib rather than a string method?
Leave it up to the user to decide whether or not their data best suits
the find method or the new search function. It sounds like we can offer
some rough heuristics, but the only way to really know is "try it and
see which works best for you".
The `string` module is an obvious place for it.
I think this is premature. There is almost never an optimization
that's a pure win in all cases. For example, on some platforms
`timsort` will never be as fast as the old samplesort in cases with a
very large number of equal elements, and on all platforms `timsort`
consumes more memory.  And it's a wash on "random" data on most
platforms. Nevertheless, it was a significant speed win for many -
possibly even most - real-life cases.

So far, the PR reduces the runtime in the bug report from about 3 1/2
hours to under a tenth of a second. It would be crazy not to gift
users such dramatic relief by default, unless there are good reasons
not to. Too soon to say. On tests of random strings with characters
following a Zipf distribution (more realistic than uniform but still
very easy to generate - and not contrived in any way to favor either
approach), the current new code it usually faster than the status quo,
in no case has been worse than twice as slow, and in a number of cases
has been 20x faster.  It's already clearly a win _overall_.

The bulk of the "but it's slower" cases now are found with very short
needles (patterns), which was expected (read my comments on the bug
report), and exacerbated by that the structure of the random test
generation is quite likely to create cases where a short needle is
found early in the haystack. This combination makes the higher
preprocessing overhead as damaging as possible. Dennis also expanded
the current code's "32-bit bitmask" trick (which is an extreme
simplification of Daniel Sunday's algorithm) to a fixed-size 256-byte
table of character-class-specific skip counts, which went a _very_
long way toward eliminating the cases where the current code enjoyed a
"pure luck" advantage over the new code. But that increased the
preprocessing overhead again, which again is relatively most
significant for short needles - those 256 bytes have to be initialized
every time, no matter the lengths of the needle or haystack.

If the new code were changed to exempt short needles, even now it
might be hard to make an objective case not to adopt it.

But it would leave open a different idea for an "opt-in" facility: one
that allowed to give a needle to a function and get back an object
capturing the results of preprocessing.  Then a different wrapper
around the search code that accepted the already-pre-processed info.
There are, e.g., certainly cases where repeated searches for a given
4-character needle could be sped significantly by exploiting the new
code, but where the overhead of preprocessing is too much to bear in
"random" performance testing. It would also open the door to more
aggressive (expensive) kinds of preprocessing.  That's where users may
be able to make useful, informed judgments.

[David Mertz]
...
That said, I do recognize that the `re` module also has pathological cases,
and the standard library documentation explicitly says "maybe you want to
try the third-party `regex` implementation."  That is sort of precedent for
this approach.
`regex` also suffers exponential time disasters, they're just harder
to stumble into - and there are books explaining in detail how and why
regex engines can fall into these traps.

It's far less _expected_ in plain string searches, and Dennis was
aware of the new algorithms because, apparently, (at least) glibc's
memmem switched to one some years ago. So we're playing catch-up here.

[Python-Dev] Re: Changing Python's string search algorithms

Tim Peters