[Guido]
The key seems to be:
Except none of that quoted text (which I'll skip repeating) gives the slightest clue as to _why_ it may be an improvement. So you split the needle into two pieces. So what? What's the _point_? Why would someone even imagine that might help? Why is one half then searched right to left, but the other half left to right? There's actually "a reason" for searching the right half left to right, but because the shift on a mismatch in the left half is a constant ("per(x)") independent of the mismatching position, it's actually irrelevant in which order the left half's characters are compared (well, at least in some variations of these newer algorithms). Over-specification can be actively misleading too. What's the key idea? "Split the needle into two pieces" is a key _step_, not a key _insight_.
I need a Python version though.
Dennis wrote one for his first stab, which you can download from the bug report (it's attached there as fastsearch.py). On the bug report's data, running that under CPython 3.9,0 on my box reduced the time from about 3 1/2 hours to about 20 seconds. Running it under PyPy, to under one second (and the current C implementation under 0.1 second).
I am not able to dream up any hard cases -- like other posters, my own use of substring search is usually looking for a short string in a relatively short piece of text. I doubt even the current optimizations matter to my uses.
They probably do, but not dramatically. Would you, e.g., notice a factor of 2 either way in those cases? A factor of 4? Of 10? When Fredrik first wrote this code, he routinely saw factors of 2 to 10 speedup on all sorts of string-searching benchmarks, both contrived and "natural". The speedups were especially pronounced for Unicode strings at the time, where skipping futile Unicode character comparisons could be more valuable than when skipping (shorter) single-byte comparisons. Should also note that the fixed string searching code is also used as a subroutine by parts of our regexp implementation, by str.count(), str.replace(), similar `bytes` operations, and so on.
What are typical hard cases used for?
It's kinda like asking what typical hard rounding cases for pow() are used for ;-) They aren't deliberate. They "just happen", typically when a pattern and the text to search contain self-similarities "but with glitches". Search the string text = "X" * 10_000_000 for needle = "X" * 1_000_000 Easy! It matches at once. But now tack "Y" on to the end of the needle. Then it's impossible to match. Brute force search first finds a prefix match at text{; 1000000] but then fails to match the trailing "Y". So brute force uses another million compares to match the prefix at text[1 : 1000001]. But again fails to match the trailing "Y". Then another million plus 1 compares to fail to match starting at index 2, and again at index 3, and again at index 4, ... it approaches a trillion comparisons before it finally fails. The current code starts comparing at the right end of each trial position first. Then an "X" from the text and the needle's trailing "Y" mismatch at once. That's a million times faster. Here's a simple case where the current code is horribly slow, because neither "right-to-left" nor any of its preprocessing tricks do any good at all. Even Daniel Sunday's full-blown algorithm[1] (which the current code strips down _almost_ to the point of non-existence) wouldn't help (unless it used a variant I've never actually seen that compared the rarest pattern character first): ("X" * 10_000_000).find("X" * 500_000 + "Y" + "X" * 500_000) The newer code returns -1 seemingly instantly.
DNA search? (That would be cool!)
It can certainly help. Self-similarities are bound to be common in long strings from a 4-character alphabet (ACGT). But, to be fair, serious work with genomes typically invests in building a giant suffix tree (or enhanced suffix array) for each cataloged sequence to be searched. No preprocessing of needles is needed then to guarantee worst-case efficient searches of many kinds (including things like finding the longest adjacent repeated subsequences, more like a regexp (.+)\1 kind of search). But the new code could quite plausibly speed more casual prototyping of such explorations. [1] https://dl.acm.org/doi/abs/10.1145/79173.79184