I'm sure that the large majority of the string searches I've done are in Larry's tiny category. 

However, I also think that big data needs are increasing, and things like FASTA files can be enormously large texts that one has good reasons to search on.

If there is a heuristic switch between algorithms, it seems like it should threshold on needle, not on haystack. Or possibly some more complex function of both. But if I understand the two-way approach correctly (which I probably don't), there's not really much gain for small needle.

On Wed, Oct 14, 2020, 12:09 AM Larry Hastings <larry@hastings.org> wrote:

On 10/13/20 9:54 AM, Tim Peters wrote:
Due to the natures of the old and new algorithms, neither will be
faster or slower in all cases, the newer one should never be
dramatically slower than the old one, the new one will be
spectacularly faster in some cases, and "in general" it seems
impossible to guess in advance. It depends on the value the fancier
preprocessing can extract from the pattern versus the higher
preprocessing cost, and that in turn depends on details of exactly
what the patterns and texts to be searched are.

My off-the-top-of-my-head reaction: I've always assumed that most string searching is tiny.  Like, searching for a < 10 character string in a < 256 character string.  That's what I'm always doing, at least.  So while I'd obviously like to see us address the pathological cases cited--and thanks to Dennis Sweeney for the PRs!--I'd hope that it doesn't make these small searches any slower.  Your email didn't give me a sense of how much additional preprocessing these new algorithms require; the fact that they're O(1) space suggests they aren't bad.  But if you do lots and lots of dinky searches surely it would add up.

Looking at the PR, I see there's a special case for searching for a single character, and then cases for forward-search and reverse-search.  I was wondering if I'd find a heuristic like "if the string you're searching in is < 2k, use the old search algorithm".  Is that worth pursuing?  It doesn't seem like the maintenance cost would be that high; I don't think anybody has touched that code in years.  (No surprise, the Effbot did a good job.)


Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/D5GYO2FVUW762RGZ5NMYKYM7PPWWRDIS/
Code of Conduct: http://python.org/psf/codeofconduct/