[Steven D'Aprano <steve@pearwood.info>]
Perhaps this is a silly suggestion, but could we offer this as an external function in the stdlib rather than a string method?
Leave it up to the user to decide whether or not their data best suits the find method or the new search function. It sounds like we can offer some rough heuristics, but the only way to really know is "try it and see which works best for you".
The `string` module is an obvious place for it.
I think this is premature. There is almost never an optimization that's a pure win in all cases. For example, on some platforms `timsort` will never be as fast as the old samplesort in cases with a very large number of equal elements, and on all platforms `timsort` consumes more memory. And it's a wash on "random" data on most platforms. Nevertheless, it was a significant speed win for many - possibly even most - real-life cases. So far, the PR reduces the runtime in the bug report from about 3 1/2 hours to under a tenth of a second. It would be crazy not to gift users such dramatic relief by default, unless there are good reasons not to. Too soon to say. On tests of random strings with characters following a Zipf distribution (more realistic than uniform but still very easy to generate - and not contrived in any way to favor either approach), the current new code it usually faster than the status quo, in no case has been worse than twice as slow, and in a number of cases has been 20x faster. It's already clearly a win _overall_. The bulk of the "but it's slower" cases now are found with very short needles (patterns), which was expected (read my comments on the bug report), and exacerbated by that the structure of the random test generation is quite likely to create cases where a short needle is found early in the haystack. This combination makes the higher preprocessing overhead as damaging as possible. Dennis also expanded the current code's "32-bit bitmask" trick (which is an extreme simplification of Daniel Sunday's algorithm) to a fixed-size 256-byte table of character-class-specific skip counts, which went a _very_ long way toward eliminating the cases where the current code enjoyed a "pure luck" advantage over the new code. But that increased the preprocessing overhead again, which again is relatively most significant for short needles - those 256 bytes have to be initialized every time, no matter the lengths of the needle or haystack. If the new code were changed to exempt short needles, even now it might be hard to make an objective case not to adopt it. But it would leave open a different idea for an "opt-in" facility: one that allowed to give a needle to a function and get back an object capturing the results of preprocessing. Then a different wrapper around the search code that accepted the already-pre-processed info. There are, e.g., certainly cases where repeated searches for a given 4-character needle could be sped significantly by exploiting the new code, but where the overhead of preprocessing is too much to bear in "random" performance testing. It would also open the door to more aggressive (expensive) kinds of preprocessing. That's where users may be able to make useful, informed judgments. [David Mertz]
That said, I do recognize that the `re` module also has pathological cases, and the standard library documentation explicitly says "maybe you want to try the third-party `regex` implementation." That is sort of precedent for this approach.
`regex` also suffers exponential time disasters, they're just harder to stumble into - and there are books explaining in detail how and why regex engines can fall into these traps. It's far less _expected_ in plain string searches, and Dennis was aware of the new algorithms because, apparently, (at least) glibc's memmem switched to one some years ago. So we're playing catch-up here.