I'm sure that the large majority of the string searches I've done are in
Larry's tiny category.
However, I also think that big data needs are increasing, and things like
FASTA files can be enormously large texts that one has good reasons to
search on.
If there is a heuristic switch between algorithms, it seems like it should
threshold on needle, not on haystack. Or possibly some more complex
function of both. But if I understand the two-way approach correctly (which
I probably don't), there's not really much gain for small needle.
On Wed, Oct 14, 2020, 12:09 AM Larry Hastings
On 10/13/20 9:54 AM, Tim Peters wrote:
Due to the natures of the old and new algorithms, neither will be faster or slower in all cases, the newer one should never be dramatically slower than the old one, the new one will be spectacularly faster in some cases, and "in general" it seems impossible to guess in advance. It depends on the value the fancier preprocessing can extract from the pattern versus the higher preprocessing cost, and that in turn depends on details of exactly what the patterns and texts to be searched are.
My off-the-top-of-my-head reaction: I've always assumed that most string searching is tiny. Like, searching for a < 10 character string in a < 256 character string. That's what I'm always doing, at least. So while I'd obviously like to see us address the pathological cases cited--and thanks to Dennis Sweeney for the PRs!--I'd hope that it doesn't make these small searches any slower. Your email didn't give me a sense of how much additional preprocessing these new algorithms require; the fact that they're O(1) space suggests they aren't bad. But if you do lots and lots of dinky searches surely it would add up.
Looking at the PR, I see there's a special case for searching for a single character, and then cases for forward-search and reverse-search. I was wondering if I'd find a heuristic like "if the string you're searching in is < 2k, use the old search algorithm". Is that worth pursuing? It doesn't seem like the maintenance cost would be that high; I don't think anybody has touched that code in years. (No surprise, the Effbot did a good job.)
*/arry* _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/D5GYO2FV... Code of Conduct: http://python.org/psf/codeofconduct/