On Sep 21, 2019, at 04:35, Richard Musil
I wrote all this to show that without an insight it might be sometimes difficult or impossible to do it right (I was caught myself in several pitfalls on my original benchmarks I posted here) and also because it was actually a fun to learn quite a few things from this thread. I am however not convinced that step-compare lists can compete with set operations (even when done on lists) at the current state of things.
One case I think you didn’t test is when the strings are generated in already-sorted order. In that case, as opposed to the case where you generate in random order and then sort, I think the PyUnicode objects and the actual character arrays will both be mostly contiguous (at least in a fresh process)—although I think the only way you could verify that is by using ctypes (or a C extension) to peek into the internal PyUnicode structs and look at the appropriate pointers. Anyway, the access pattern should more than be simple enough for the CPU to effectively cache-prefetch at all three levels, which could be a significant speedup for the sorted-list algorithms over the set algorithms. And this isn’t entirely an artificial scenario; sometimes large numbers of strings really are generated in sorted order and in large batches like this—e.g., the filesystem, or a network client, or whatever is handing you data in sorted order and you put it right into a list of strings. So, it might be worth testing. But I think the real win for already-sorted data is laziness. With sets, at least one of your data sets has to be strict, so you can put it in a set. With sorted lists, with a minor rewrite to the algorithms (which I posted earlier for intersection), you can use them directly on iterators instead. Imagine you’re reading 7GB worth of (already-sorted) strings and comparing to 2GB worth of strings on a machine with 8GB of RAM and plenty of swap space. Clearly the set, sorted list, or even anything clever that pandas might do, they’ll all throw your machine into swap hell, which will probably swamp the cost of using step-compare instead of sets and of using Iterators instead of lists. This one might be harder to benchmark (because the cost of the Iterator is going to end up threaded through the cost of the algorithm), but it seems like it should be a huge win over sets.