Set operations with Lists

I'm not sure if there is any interest by others but I have frequently come across cases where I would like to compare items in one list in another similar to relational algebra. For example are the file names in A in B and if so return a new list with those items. Long story short, I wrote some functions to do that. They are quite simple and fast (due to timsort in no small part). Even the plain python code is faster than the built in set functions (afaik). I created a github and put the ones I thought the community might like in there. https://github.com/ponderpanda/listex an example would be a = [1,2,3,4,5,6] b = [1,3,7,4] list_intersection(a,b, sorta=True, sortb=True) returns [1,3,4] The complexity ends up being about the longer of list a or b. if they are over some thousands the speed difference with set() operations becomes significant. There are some details about naming, unique values (not necessary), and sort order that would probably need to be ironed out if they were to be included with the built in list type. I'm not qualified to do that work but I'd be happy to help. Best Regards, Richard Higginbotham

On Thu, Sep 19, 2019 at 03:26:27PM +1000, Chris Angelico wrote:
Sets are unordered and can contain no duplicates. Lists are ordered and can contain duplicates, so you might think of them as "ordered multisets". I can't say I'm very enthusiastic about that "sorta, sortb" API, and I'm rather surprised about Richard's claim that a pure-Python list intersection is faster than set intersection written in C (but maybe Richard is a better programmer than me). But I'd like to hear more about this proposal. -- Steven

On Thu, Sep 19, 2019 at 10:32 PM Steven D'Aprano <steve@pearwood.info> wrote:
With the sorta/sortb API and the algorithmic assumption that the heads of both lists can be checked, I think the value of element order has been lost, so the only significance is the possibility of duplicates. The given use-case of file names doesn't justify this. So I'm curious what the OP's actual use-case was for using lists here. ChrisA

The value of the order is actually the critical part. If they are both ordered I can complete all comparisons in one pass over both lists at the same time. Lets think about in A and not B a = [1,2,3,6] b = [1,4,5,6,7] start at a[0] and b[0], 0 -- a[i] == b[j] so increment both indexes. 1 -- a[i] < b[j] so b cannot contain a[i] this is due to sort order, append a[i] to result and increment i 2 -- a[i] < b[j] so b cannot contain a[i] this is due to sort order, append a[i] to result and increment i .... every round of comparisons increments at least i or j if not both. That makes the run time roughly len(a) + len(b) The original use case was detecting new files in various directories. Each directory was expected to have millions of 4KB files and maybe at most 100k of new files. This was in around 2002 and it was a huge time savings for me. I'm not even sure that set's were available then. Originally it was a C API extension. Computers have gotten faster enough that the speed difference doesn't matter if you convert to and from sets unless we are talking about millions of elements in a set. The original source was lost and the pure python version has been adequate for anything I've needed in the last 10 years. The algorithm's are actually very simple and short (maybe 200 lines total). I don't know if something like this would be useful for others, or it might be useful for a specific group like scipy with very large datasets. It might / probably? even be adapted to be used in the built in set operations.

On Fri, Sep 20, 2019 at 5:11 AM Richard Higginbotham <higginbr@gmail.com> wrote:
The original use case was detecting new files in various directories. Each directory was expected to have millions of 4KB files and maybe at most 100k of new files. This was in around 2002 and it was a huge time savings for me.
I can't help thinking that the time for such a job would be dominated by disk operations here :) If a single directory has quote "millions" of files, you're going to see some pretty significant differences based on which file system you're using; not every FS copes well with gigantic directories. ChrisA

That's exactly what I expected. It's the opposite though. The time to check one list for all the elements in another list was far longer. Granted this was a Linux box using ext2. The other helpful element is that the filesystem returned the files in the proper order. One of the tests runs I mentioned was A 100000, B 500000 because that was a realistic scenario. If you actually try that with the naive method you will be waiting for a very long time to get a result, even today. Using sets, if you know to use sets, then its only a 10th of a second. I think both those methods are polynomical run time so I was pretty proud of my linear solution at the time.

On Fri, Sep 20, 2019 at 6:31 AM Richard Higginbotham <higginbr@gmail.com> wrote:
That's exactly what I expected. It's the opposite though. The time to check one list for all the elements in another list was far longer. Granted this was a Linux box using ext2. The other helpful element is that the filesystem returned the files in the proper order. One of the tests runs I mentioned was A 100000, B 500000 because that was a realistic scenario. If you actually try that with the naive method you will be waiting for a very long time to get a result, even today. Using sets, if you know to use sets, then its only a 10th of a second. I think both those methods are polynomical run time so I was pretty proud of my linear solution at the time.
Please quote some text to give context. WHAT is exactly what you expected? Building a set out of a series of values takes linear time, and set intersection also takes linear time. On what basis do you say that "both these methods" (again, not sure what you're referencing since you didn't quote any text) have polynomial run time? ChrisA

Chris Angelico wrote:
My recent guess was that set operations would be greater than nonlinear as I expected the size of both lists to play a part in the time complexity. If nothing creating the hash tables would take additional time. Looking at the post by Tim Peters, your self, and others I've come to the conclusion that shouldn't be the case. That seems to hold true when A is small, but not as A increases. It doesn't explain the results so I plan on looking into it further this weekend.

On Sep 19, 2019, at 12:09, Richard Higginbotham <higginbr@gmail.com> wrote:
It might / probably? even be adapted to be used in the built in set operations.
How? I’m not sure if you get how sets work, and maybe you think they’re based on a red-black tree or skiplist or some other logarithmic collection, like C++ and Java sets? It that were true, then implementing set.intersection as a step-compare instead of just {x for x in self if x in y} would be an optimization—it would replace O(nlogm) time with O(n+m). (But then such a set would almost certainly already be using step-comparison—or, in fact, would already be optimized it with subtree pruning, which has the same worst case but better best case—before anyone added it as a builtin.) But sets actually use a hash table, just like dicts do. This means they work on objects that can’t be compared (e.g., you can throw a bunch of complex numbers in a set). So you couldn’t possibly use an algorithm that requires comparison to implement a set method, because it would raise a TypeError on perfectly valid sets. This also means that lookup is an O(1) operation, not O(logn). So, intersection is O(m) time, which is better than your O(n+m), not worse. It also means that insert is amortized O(1) time, so you can build a set in O(n) time. So, if the collections aren’t prepared in advance, building a set to intersect with an iterable is O(n+m) time, which is better than sorting two iterables to step-compare them in O(nlogn+mlogm+n+m) as you’re proposing.

Its faster than the built in set type. Uniqueness of values isn't a requirement. The values need to be comparable so that they can be sorted. This allows a linear time complexity. That's basically the only restraint on the values. Here are some speed test results and pseudocode. def naive(a,b): res = [x for x in a if x not in b] def set_test(a,b): aset = set(a) bset = set(b) res = aset - bset res= list(res) For A not in B: trying A 10000 elements B 50000 elements naive test 10.285029 s set test 0.008001 s listex test 0.004000 s For A subset of B: trying A 100000 elements B 500000 elements naive too long to wait :-) set test 0.125013 s listex test 0.054005s

On Fri, Sep 20, 2019 at 5:08 AM Richard Higginbotham <higginbr@gmail.com> wrote:
Its faster than the built in set type. Uniqueness of values isn't a requirement. The values need to be comparable so that they can be sorted. This allows a linear time complexity.
Note that the linear time complexity assumes that they have *already* been sorted (not just that they *can be* sorted), as otherwise the process will incur an O(n log n) sort step first. ChrisA

I meant a near linear time complexity. n log n is very very close to n even when n is large. It's not constant, if it were it could be dropped entirely... It's pretty close though.

On Sep 19, 2019, at 12:07, Richard Higginbotham <higginbr@gmail.com> wrote:
Its faster than the built in set type.
Not according to timeit. If you want to dispute that, you need a proper benchmark, not just repeating the claims from your inaccurate one. I’m not sure what your relationtype is supposed to do (especially given than 1 and 2 produce the same values), and all versions seem like unusual use cases rather than anything approaching realistic, but I left that all alone and just replaced your benchmarks with timeit, repeated them enough times to actually be meaningful, and changed the test to the operation you claim it’s about rather than a different one. The output pretty clearly shows that a pre-existing set beats a pre-sorted list, and building a set on the fly beats sorting a list on the fly, and even comparing blatantly unfairly with building a set on the fly vs. using a pre-sorted list, the set still wins except in the smallest test. I didn’t include tests against more realistic datasets where your algorithm does significantly worse; this shows that even your chosen input is slower with your algorithm if you measure it properly. And, while we’re at it, I added a version of your algorithm that works with arbitrary iterables and also takes a key function (which you always want when you’re doing anything related to sorting in Python), and it’s about 10-20% faster than your list version, which shows that you aren’t getting any performance benefit from restricting yourself to lists only. See https://pastebin.com/BC2zgtN6 for the code.
Uniqueness of values isn't a requirement.
Of course. But non-unique values make the set algorithm faster, and they make your algorithm slower. And nearly-unique values don’t affect the set time, but they make your algorithm slower. So you can’t just ignore it. And, again, you have to decide what happens with non-unique values, how many times they appear in the output.
The values need to be comparable so that they can be sorted. This allows a linear time complexity.
Except that sorting is log-linear, not linear, so it doesn’t. Meanwhile, set actually _does_ allow linear time complexity. Again, none of this means your code is useless. I think a step-compare intersection is useful enough to belong in itertools, at least as a recipe. But it’s not faster than set, or simpler, or anything else you’re arguing.

On Sep 18, 2019, at 12:29, Richard Higginbotham <higginbr@gmail.com> wrote:
I'm not sure if there is any interest by others but I have frequently come across cases where I would like to compare items in one list in another similar to relational algebra.
For anyone who doesn’t want to read this whole reply: I think a step-compare function is useful often enough. and confuses novices often enough (if StackOverflow is any guide), that it might belong in the stdlib. But I think it should be in itertools, and should work on arbitrary iterables that are assumed to be pre-sorted (an assumption groupby already makes, and explains in the docs). And if it’s not in more-itertools and/or toolz, maybe it’s worth submitting it to them first and seeing how much people use it. Now on to why: To start with, itertools is already about providing an iterator algebra that isn’t identical to relational algebra, but is a lot closer than what the list builtin provides.
For example are the file names in A in B and if so return a new list with those items. Long story short, I wrote some functions to do that. They are quite simple and fast (due to timsort in no small part).
Even the plain python code is faster than the built in set functions (afaik).
You’re testing pretty small values in a very narrow use case, and testing them inaccurately by using datetime instead of timeit (the fact that some of your results are 0 vs. 0 should be a clue…), and you haven’t even given us the results, just claimed that they’re faster. When I run my own test on lists of 10000 strings made up 5 random lowercase characters, I get 485us for set(a).intersection(b), and 6609us for list_intersection(a, b), so it’s actually more than 10x slower. What if the values are already sorted, or already in sets? Then I get 175us for a&b, 3250us for list_intersection(a, b, False, False), so again it’s more than 10x slower. Maybe it’s because my set has a lot more duplicates. What I do 50-character strings, and verify that there are no dups? Now it’s 525us vs. 6699us, or 225us vs. 2910us for pre-prepared values—a bit closer, but still more than 10x slower. And meanwhile, what if I simulate long pathnames with a common prefix? Now I get 525us vs. 10200us.
How can that be true? Sorting a takes nlogn steps, sorting b takes mlogm steps, step-comparing then takes n+m steps, so that’s O(nlogn + mlogm * n + m) = O(nlogn), which is bigger than O(n). By comparison, making a set of b takes m steps, iterating over a and looking up each value in the hash takes n steps, so it’s O(n+m) = O(n), which is smaller than O(nlogn). It’s possible that the constant multiplier for hash lookups is so much higher than the multiplier for comparing and nexting that you’d need to get into the zillions before the logn factor outweighs the multiplier, in which case you could argue that it’s a good optimization despite having worse complexity. But you can’t argue that it has better complexity. Also, if that lower constant really is critical to performance, surely it’s not applicable to all values, as the long-pathname case demonstrates. Plus, given that your code is about 10x as slow even _without_ the sort, it doesn’t seem to be true that the multiplier is better in the first place. Also, forgetting about performance, sort-and-step doesn’t work on values that aren’t comparable, while set does. On the other hand, set doesn’t work on values that aren’t hashable, while sort-and-step does. Strings and integers are both, but lots of types are only one or the other (list, complex). For something general-purpose enough that you want to add it to builtins, that seems at least worth mentioning.
if they are over some thousands the speed difference with set() operations becomes significant. There are some details about naming, unique values (not necessary), and sort order that would probably need to be ironed out if they were to be included with the built in list type.
Why would you want this as a method of list rather than a function that works just as well on any sortable iterables? The reason we have a sort method (as well as, not instead of, a sorted function) is to allow sorting in-place, which isn’t relevant here. We don’t have methods for zip, filter, etc. Also, in most cases where I’ve wanted to intersect two lists of file names, I want to preserve the order of the first “master” list. You can’t do that with sort-and-step except by doing something pretty complicated (e.g., decorating with enumerate, using a custom compare that ignores the indexes in your intersection, then re-sorting the result by index and undecorating), but it’s trivial with sets: filterset = set(b) c = [val for val in a if val in filterset] … or … c = filter(set(b).__contains__, a) And this is still O(n+m) time, just like set(a)&set(b) would be, and with a faster constant because we’re only setifying one of the two iterables (and often the smaller one). And this has the advantage that the potentially huge “master” input can be a lazy Iterator. If a is a file listing the first names of each American, and b is a much smaller set of the legal names for French children, loading that whole file into memory to sort it would be a bad idea, and it’s not necessary if you use sets. Also, the last time I wanted to intersect two lists of file names was to diff two directories, so I needed a-b and b-a, not just a&b. Is there really a compelling use case for intersection that isn’t also a compelling use case for at least some of the other set operations? In fact, you’re arguing for just intersection, but then provided an implementation for some of the others. Which do you want, and why that particular set? Also, if you care about duplicates (and if you don’t care about either order or duplicates, why aren’t you just using a set in the first place?), you need to specify whether this is multiset intersection (so if X appears 5 times in A and 2 times in B it shows up 2 times in A&B), or set intersection (so it shows up just once). The most obvious way of doing sort-and-step is going to do neither of those (it’ll be 5, because it appears 5 times in A and each of those has a matching value in B); you could implement either pretty easily, but only if you decide which you want and then do it. With sets, you explicitly make the choice of set or Counter, and then whether you get set or multiset behavior is obvious; here, it’s not obvious, and if you want it to be a choice you have to add another flag argument.

I'm sorry I haven't used the mail list before and didn't send my comments to the list. Please see my response to Chris for some example times. I'm reluctant to use anything less than about a ms for timing purposes. There's too many interactions at the us level that can skew results. I'd prefer to expand the sample size until it's more clear. Here is an example For A not in B: trying A 10000, B 50000 naive test 0:00:10.319032 set test 0:00:00.009001 listex test 0:00:00.005000 For A subset of B: trying A 100000, B 500000 set test 0:00:00.180018 listex test 0:00:00.054006 If I increase a and b by 10 times the execution time goes up by about 100x on the set. This is because its related to a * b. the listex only goes up by about 10x which is about 10a + 10b. trying A 10000, B 10000 naive test 0:00:02.661266 set test 0:00:00.002000 listex test 0:00:00.005001
c = [val for val in a if val in filterset] This is the crux I think. The complexity is the number of elements in filterset. Since filterset has to be check for every value in a its O(len(a) * len(b) * filter search time). You wrote O(n+m) but I think that is incorrect.
The module (listex) I linked to implements the other operations as well. I didn't mean just intersection I was just using that as an example. If you run the module it will generate some time comparison for that, but its just for comparison not meant to be exhaustive. It's the set of functions I have found useful in the past. It's not all of the ones in the module I regularly use for purely academic purposes actually.

On Fri, Sep 20, 2019 at 6:19 AM Richard Higginbotham <higginbr@gmail.com> wrote:
c = [val for val in a if val in filterset] This is the crux I think. The complexity is the number of elements in filterset. Since filterset has to be check for every value in a its O(len(a) * len(b) * filter search time). You wrote O(n+m) but I think that is incorrect.
Yes, this IS the crux. If filterset were a Python list, you would be correct, but if (as the name suggests) it's a Python set, then checking "val in filterset" takes constant time, meaning that the entire process takes linear time. That is the entire point of using a hashtable for a set. ChrisA

It's not really constant though. Look at all of the test results here. Richard Musil in the following post shows one value of A=1000,B=10000 with .034 ms and the next point A=10000,B=50000 If the hash were constant time and A increased by 10x then the time should be 0.34ms but its not. It's 0.54 ms. The next step up is 8.5ms instead of 0.54 ms. If you start including the time it takes to convert lists to sets its even more pronounced. hash values can collide and the bigger the data set the more likely it is to happen. Perfect hash functions don't exist.

On Sep 19, 2019, at 19:18, Richard Higginbotham <higginbr@gmail.com> wrote:
It's not really constant though.
It’s really hard to have a discussion when all of your posts are all replies, but you don’t give us any clue what you’re replying to. There are multiple responses from multiple people since your last email, each with multiple paragraphs covering multiple subjects. So I have no idea what “it” refers to. Or what “though” is challenging. But I think what you’re trying to do is argue that hash tables don’t work.
hash values can collide and the bigger the data set the more likely it is to happen
No. The mean rate of collisions does not depend on the size of the data set, it depends on the load factor—the size of the data set divided by the size of the hash table. This means that, even if you don’t know the size of the needed hash table in advance (which we actually _do_ know here), just expanding geometrically (the same way lists expand) whenever you reach a triggering load factor is good enough to guarantee amortized constant time for all operations. The fact that it sounds implausible to you at first glance and you can’t understand one set of numbers does not overthrow decades of theoretical proofs and practical experience. Hash tables really do work.

You know, if anyone wanted a good example of why Computer Science courses should carry on teaching about sorting algorithms despite them being already implemented for you these days, this thread is it. Very nicely explained, folks! -- Rhodri James *-* Kynesim Ltd

Andrew Barnert wrote:
The number of values in A is the limiting factor for set operations. That's why I didn't use large values for A and small values for B. Past that the time it takes to make a set from a list is the other limiting factor. I'm not proposing to replace the set library. If the user has two sets use the set functions, they are plenty fast. If the user doesn't have sets then my list functions are faster than converting to sets and then back again, at least after the size of the lists has reached some threshold.

[Richard Higginbotham <higginbr@gmail.com>]
That's not so. Python's hash tables dynamically resize so that the load factor never exceeds 2/3. The worst case then is a "large" table as full as it's allowed to get before Python grows its size. In that case, the average number of probes for a successful search is 1.65, and for a failing search 3.00. It doesn't matter to this how large the tables get. For very small tables, the average values are lower, but 1.65/3.00 is good to the full 3 digits already at tables with 4096 slots (which can hold at most 2730 items). Of course there are pathological cases, but they're exceedingly unlikely to occur when using strings as keys, because Python's string hashes are a good approximation to (pseudo)random.
Perfect hash functions don't exist.
Of course they do, but it's impractically expensive to construct a perfect hash function for dynamically changing data. In Python, it so happens that when dict keys (or sets) consist of a contiguous range of integers (range(J. J+N) for some J and N), the integer hash function usually IS "perfect": no collisions at all. That's intentional and unusual, and follows in part from that hash(J) == J for "not huge" positive integers J. Something I haven't seen mentioned here, but I may have missed it: when timing algorithms with sets and dicts, people often end up merely measuring the bad effects of hardware data cache misses when the containers get large, and especially in contrived benchmarks. In those the base data tends to get built up in a contiguous range of memory, and storing the objects in a list leaves traversing the list fetching the objects' data in the same contiguous memory order. It's as cache-friendly as can be. Traversing via a set or dict instead reads the objects' data in pseudo-random memory order instead, which can give a very high rate of cache misses. In addition, the internal hash table probes also occur in pseudo-random order, adding yet another layer of cache misses. Plain old lists are in fact pleasant in many ways :-)

On Thu, Sep 19, 2019 at 10:25:20PM -0500, Tim Peters wrote: [...]
I'm not sure if this is something specific to timing tests, or if it will hold for real-world data in non-contrived code too. If it holds for real data, then presumably it counts as as advantage of lists over sets. But if it is an artifact of naive/contrived timing tests, is there something we can do to make the tests less naive? -- Steven

This is an interesting point, which is difficult to deduce without an implementation insight. I would for example assume that there is a "list construct" which holds just the references to the actual objects (strings in our example here) and the same for the dict (hashmap), so the actual objects could be anywhere, while the "list" or "dict" can be in one (memory) place. Did you mean that, or that also the object themselves are in the "same place"? I believe there are two factors which may affect the performance. The cache miss on actual data objects (e.g. the strings), and the cache miss on the "construct" object (i.e. the list or dict). But I have no idea how big are lists or dicts (of the same size), or even if all those assumptions above are correct. On Fri, Sep 20, 2019 at 10:43 AM Steven D'Aprano <steve@pearwood.info> wrote:
I'm not sure if this is something specific to timing tests, or if it will hold for real-world data in non-contrived code too.
The only thing which could impact it on the "real data" is an execution environment, i.e. for example, if the cores/threads are loaded and the operations get preemtped by the scheduler (which would also invalidate the cache). Then depending on the frequency of those invalidation it could make the importance of the "fit into cache" condition less important. But it would depend on the system scheduler and the platform configuration (may differ on the server and on the normal desktop machine). But technically the difference should not be bigger on the real data. Richard

On Sep 20, 2019, at 03:18, Richard Musil <risa2000x@gmail.com> wrote:
This is an interesting point, which is difficult to deduce without an implementation insight. I would for example assume that there is a "list construct" which holds just the references to the actual objects (strings in our example here) and the same for the dict (hashmap), so the actual objects could be anywhere, while the "list" or "dict" can be in one (memory) place. Did you mean that, or that also the object themselves are in the "same place"?
With a list of strings, it’s actually even more than that. A CPython list is a (pointer to a) struct that holds, along with a few other things, a pointer to an array of PyObject* pointers. This array is obviously contiguous by definition. By comparison, a set is a (pointer to a) struct that holds a pointer to a hash table, which is an array of buckets in hash order, which isn’t the order you’re going to be accessing them in for the b set. If the values are strings, the pointers point to PyUnicode structs, each of which holds… well, it’s complicated, but in this case (where they’re all pure ASCII and constructed out of normal Python operations) it’s going to turn out to be a pointer to a char*, which is the array of actual characters. Also notice that string objects cache their hash in the struct, so for the set case, it only has to go to the char array at the very end to double-check equality; the operations before that only go one level down instead of two. Comparisons have to go all the way to the chars every time. The PyUnicode structs get allocated by a special object allocator which is somewhat complicated. When you construct a bunch of objects in order in a fresh process they’re likely to end up mostly in contiguous order, maybe with a few jumps every so often. This would not be as true in a long-lived process that’s created and freed zillions of previous objects. And then the char arrays are allocated by a different allocator, but are also likely to be largely in contiguous order in this test, but not so much in a long-lived process. In theory, because str is a known immutable type, Python is allowed to merge dups into the same object. But in practice, we’re creating midsize strings out of __add__ and join, so I don’t think CPython will ever do that here. (I don’t know whether an implementation designed to sacrifice speed for space more, like MicroPython, might?) Of course if you generate random strings and sort them, neither the objects nor the char arrays will be contiguous anymore, but the list’s pointer array will be. My version of the test generates strings that happen to be in order and shuffles them so we can sort them. In that case, the sorting part won’t have contiguous input in the objects and char arrays, but the input to the step-compare part will be close to contiguous again (although not exactly so, if there are any dups that could get shuffled out of order), so it’s still giving the list code an unfair advantage that it wouldn’t have in a real life unsorted case.

[Tim]
[Steven D'Aprano <steve@pearwood.info>]
I'm not sure if this is something specific to timing tests, or if it will hold for real-world data in non-contrived code too.
Depends on the data access patterns of the app. There's really no simple way to know. One possible clue: in recent CPythons, sys._debugmallocstats() can be called to get info about the state of Python's small object allocator. Compare the "# bytes in allocated blocks" output line to the total number of bytes across all arenas. If that ratio is close to 1.0 then at least we're _using_ most of arena memory. The closer to 0 it gets, the more data is splattered around, with large gaps between used bytes. Since CPython _never_ moves an object in memory, fragmentation can become nearly arbitrarily bad. As bad as using only 8 bytes out of each 256 KiB allocated (although I expect it would take effort to contrive a program in the same universe as being that bad).
What's the first rule of benchmarking? RUN ON A MACHINE AS QUIET AS POSSIBLE So it's deliberately and spectacularly naive. Benchmark results should really be viewed as lower bounds: guarantees that no related real-life code will ever run faster, and almost always slower ;-) They remain the easiest way to guess whether one approach "is likely to" run faster than another, though. An easy example of major cache effects: xs = list(range(1000000)) sum(xs) Then do the same, but use random.shuffle(xs) before summing. Just timing the `sum(xs)` part, a quick timeit run on my non-quiet box just now showed the latter taking well over 4x longer. The original spelling tends to access RAM (for the guts of the int objects) in sequential order, but shuffle first and sum() leaps all over RAM to get from one int to the next. The memory layout is the same either way, so staring at obmalloc stats won't yield any clue in that case.

Tim Peters wrote:
Very informative, thank you. I tried different tables sizes and saw that you are correct that the table size doesn't matter. I assumed it was collisions that were adding the extra time when I increased the size of A and B. You are probably right that cache misses are a/the factor. In theory for algebraic set operations the size of the left hand argument (that's having its values "compared" to the other set) is the limiter. When I saw that time increase beyond the size increase of A, I thought it must be the hash functions but I didn't really want to say that not knowing the internals. Actually now I wish it was as it would easier to account for in judging my algorithm :-).

On Sep 19, 2019, at 20:25, Tim Peters <tim.peters@gmail.com> wrote:
Another issue: the container itself is larger. A list’s array uses one pointer (8 bytes) per element. Even without looking at the implementation of set, a hash bucket obviously has to be at least twice that size. Plus, lists need less slack, and put their slack all at the end where you don’t need to traverse, while hash tables have their slack distributed within the table where it does get in the way. So, a list should be able to go at least 2.2x as big as a set before hitting the threshold where it’s no longer all in cache. But of course this is only true if you can forget about the objects themselves. Each string object is a lot more than 16 bytes, and the character strings themselves are as well in most tests, so the top-level difference is often swamped. Still, the nonlinear jumps when you cross boundaries like “everything in cache”, “at least the top level all in cache”, etc. (and at each level of cache) might be visible distortions (or visible reflections of a real benefit, if your application’s behavior is close enough to the artificial benchmark). You can test the behavior of list here by using a huge list of numbers from 1 to 100… but you can’t do that with set, because that would be a tiny set of 100 elements. And of course even once you know the exact characteristics of the cache behavior with your real data, that can be hard to optimize for beyond “try to keep adjacent things adjacent”. There’s a paper from the GHC guys many years ago about how they cleverly tuned the ropes (linked lists of mid-sized arrays) they used to optimize large string internals around the L2 cache and found that the code that worked great on their older Intel CPU was a huge pessimization on the next generation of CPUs because they were fooling the computer into thinking they were never going to read past the end of the current page so it didn’t prefetch the next one.

On Thu, Sep 19, 2019 at 10:18 PM Richard Higginbotham <higginbr@gmail.com> wrote:
Richard, I took your listex.py and modified it to use timeit module for benchmarking and put it here ( https://gist.github.com/risa2000/c44c1a06e89f348d82efcce5ec722774). The results I got from the modified code were: For A not in B: trying A 1000, B 5000 best naive_relcomp test time (msec): 38.125 best relcomp_set test time (msec): 0.034 best relcomp_set with set init test time (msec): 0.218 best list_relcomp test time (msec): 0.211 For A not in B: trying A 10000, B 50000 best relcomp_set test time (msec): 0.545 best relcomp_set with set init test time (msec): 3.093 best list_relcomp test time (msec): 2.101 For A subset of B: trying A 100000, B 500000 best relcomp_set test time (msec): 8.508 best relcomp_set with set init test time (msec): 55.001 best list_relcomp test time (msec): 29.242 Now, what is interesting about this is the benchmark I added: calculating the set difference while including the set setup (basically the set(list) call). This is the line with "best relcomp_set with set init". The results for those not familiar with Richard's listex.py are: 1) 'relcomp_set' is calculating set difference by using subtraction operator (a - b) on two sets built from the two unsorted lists. 2) 'relcomp_set with set init' is the same function but with the set setup included, so basically measuring set(list_a) - set(list_b) 3) 'list_relcomp' is the function using Richard's implementation of the difference on the lists (again unsorted). I did not verify the results (I hope they are correct), but the timing suggest that the set operation is always faster. When however we add the set setup to the mix, the build up time, which is O(n + m) becomes more significant than actually sorting and comparing the lists which is technically of O(nlog(n) + mlog(m)). It would also suggest that for some large lists the list implementation _can_ actually be faster if we calculate in an eventual conversion of the lists to the sets and possible even more faster if the lists were already ordered. What I cannot explain, why the setup time becomes so significant only at A 10000, B 50000 and does not make much of a difference in case A 1000, B 5000. It would suggest that there is some other factor which has a different impact depending on the set size. Richard M.

On Thu, Sep 19, 2019 at 11:58 PM Richard Musil <risa2000x@gmail.com> wrote:
Ok, I misread the original code, the lists were not sorted in the previous results (and their should be). So with the correction to have them sorted, the results are: For A not in B: trying A 1000, B 5000 best naive_relcomp test time (msec): 38.147 best relcomp_set test time (msec): 0.035 best relcomp_set with set init test time (msec): 0.219 best list_relcomp test time (msec): 0.330 For A not in B: trying A 10000, B 50000 best relcomp_set test time (msec): 0.546 best relcomp_set with set init test time (msec): 3.293 best list_relcomp test time (msec): 3.493 For A subset of B: trying A 100000, B 500000 best relcomp_set test time (msec): 8.651 best relcomp_set with set init test time (msec): 55.133 best list_relcomp test time (msec): 144.094 Which actually puts it in line with the expectations. I guess there is not much more to say. Richard M.

On Sep 19, 2019, at 15:18, Richard Musil <risa2000x@gmail.com> wrote:
Ok, I misread the original code, the lists were not sorted in the previous results (and their should be). So with the correction to have them sorted,
I think to be fair, you want to show it _both_ ways, just as you’re showing sets both with and without creating the sets. Because sometimes you’re already going to have sorted lists, just as sometimes you’re already going to have sets. So there are four things to compare: * set operation on existing sets * set operation including time to build the sets * step-compare operation on pre-sorted lists * step-compare including time to sort the lists (Also, for the set operation, there’s no need to build _two_ sets. Just build a set of one and intersect it with the other list.) Anyway, from your two separate results, it looks like: * If you have sets, set operations are faster. * If you have unsorted lists, set operations are faster. * If you have sorted lists, the times are a lot closer, and may vary in a nonobvious way, so if it matters, you probably want to profile with your actual data and the specific operation you need.

On Fri, Sep 20, 2019 at 1:52 AM Andrew Barnert <abarnert@yahoo.com> wrote:
Right. It just was giving wrong results and I was not sure if the algorithm on "particularly unsorted" list would be representative (because it would produce the wrong results), so I removed it. I added a new test `list_relcomp[presorted]` with pre-sorted (just to have the results correct) to see if "making it correct" has any impact. So there are four things to compare:
Right (again). When I was considering the algorithm on the lists, I thought that technically, I could build just one set on the fly, but the other one has to be at least hashed to test the inclusion, and I also assumed that the hash operation will be dominant in this and basically told myself that the cost of building the other set will not be significant. I just added another implementation to the test (which I believe you had on mind): ``` def relcomp_set_list(a, b): bset = set(b) return [ia for ia in a if ia not in bset] ``` and confirmed that avoiding the 2nd set build actually brings a noticeable improvement :). I also changed the naming as the original code used `relcomp_set` to mark set operation 'a - b', but returned the results as the list, which seemed to me a bit confusing, so I renamed it to `relcomp_set_to_list` (do not confuse it with `relcomp_set_list` mentioned above) and put there `relcomp_set`, which does pure set 'a - b' and returns a set. So one can see the overhead of the resulting list construction in the original `relcomp_set`. The results are: For A not in B: trying A 1000, B 5000 naive_relcomp (msec): 38.105 relcomp_set (msec): 0.025 relcomp_set_to_list (msec): 0.034 relcomp_set with set init (msec): 0.215 relcomp_set_list (msec): 0.192 list_relcomp (msec): 0.329 list_relcomp[presorted] (msec): 0.214 For A not in B: trying A 10000, B 50000 relcomp_set (msec): 0.416 relcomp_set_to_list (msec): 0.545 relcomp_set with set init (msec): 3.167 relcomp_set_list (msec): 2.546 list_relcomp (msec): 3.372 list_relcomp[presorted] (msec): 2.005 For A subset of B: trying A 100000, B 500000 relcomp_set (msec): 8.557 relcomp_set_to_list (msec): 8.519 relcomp_set with set init (msec): 55.783 relcomp_set_list (msec): 48.903 list_relcomp (msec): 143.577 list_relcomp[presorted] (msec): 120.911 There is still one peculiarity in the last test, where relcomp_set and relcomp_set_to_list gave basically the same result. I interpret is as that the result is of a small size, basically insignificant to the other operations in play. Richard M.

On Sep 20, 2019, at 02:41, Richard Musil <risa2000x@gmail.com> wrote:
Or just `set(b).intersection(a)` and maybe checking the len to pick which one. If you really want to go overboard with micro-optimizing this version, you might want to also check whether stashing bset.__contains__ and using that speeds things up (which may be different across recent versions of Python with changes to method calling?), and using it in filter instead of a comprehension (to move the loop fully into C, at the cost of needing to build a list out of an iterator at the end). But I don’t really think we need to go overboard with that. We already know that using a set is faster except possibly when you already have pre-sorted lists, and if you do have that case, I’m not sure pressing further with this artificial benchmark will tell you much anyway.

Richard Musil wrote:
On Fri, Sep 20, 2019 at 1:52 AM Andrew Barnert abarnert@yahoo.com wrote:
On Sep 19, 2019, at 15:18, Richard Musil risa2000x@gmail.com wrote:
After some thought I don't think the the time comparisons are useful unless we include the set init time. If we already have two sets, would anyone want to compare and convert that into a list? In the use case where we read a list of files, wouldn't we want to put that into a list instead of a set? I would think we would only put it into a list if we knew we would be doing set operations. At that point we would need to compare adding the elements individually to a list versus a set to then compare the set operations with list operations fairly. At small data sizes we are probably just comparing the speed of python with C. Perhaps I should write a C extension module to make sure its and apples to apples comparison. I'll try that when I get some time.

On Sep 20, 2019, at 09:21, Richard Higginbotham <higginbr@gmail.com> wrote:
And? If I want to get a-b, a&b, and b-a, as I’d need to for a directory diff, I’m not going to do it by keeping them in lists and converting to sets three times. Just like I’m not going to keep them unsorted and sort them three times to do step-compare operations. So the cost of the operations themselves, separate from the setup time, clearly does matter.
At that point we would need to compare adding the elements individually to a list versus a set to then compare the set operations with list operations fairly.
Why would you be adding them individually? Surely, at least if you care this much about micro-optimization, you’re going to be adding them in bulk. If you’re reading a file full of file names, you’d just do list(f) or set(f). If you’re reading 10 files full of file names, or getting them out of os.walk steps, you’re going to call extend or union with each batch, not iterate one by one and add them. Of course the actual details depend on the actual use case, and I’m not sure why you’re trying to come up with a microbenchmark that exactly fits some imaginary and ill-defined use in the first place. Do you then need to go simulating a filesystem’s readdir performance characteristics in some particular but arbitrary scenario? What’s the goal here? I think it’s pretty easy to establish that sets are fast but they’re not magic, sorting is pretty good but it’s not magic, there are some cases where step-compare is faster if you have already-sorted lists… and none of this matters anyway, because in real life there are very few use cases where you can equally choose between sets and step compare and need to micro-optimize, but lots of cases where you have _other_ reasons to choose between them. Sets work on non-comparable values; step-compare works on non-hashable values. Step-compare (if implemented right) works on lazy iterables; sets can preserve the order of the “master”input. They do different things with duplicates. Even when performance is the only issue (and it’s not dominated by the file system anyway, as it would be in your case), the characteristics of your specific data are going to make a huge difference. Not to mention that there are often other compelling reasons to have a set or a sorted list, at which point the choice is made for you. So, this already eliminates the possibility of reimplementing set operations with step-compare as suggested earlier. But it also already makes a compelling case for adding step-compare functions to Python—but that case needs to be made clearly. I think they’d fit very nicely in itertools, because they already provide an algebra on iterables, including functions like group you that expect to be handed pre-sorted iterables, but maybe not everyone agrees. I don’t know whether their usefulness (and stability) is so obvious that they don’t need to go on PyPI to prove themselves. I also don’t know whether they already exist on PyPI (maybe as part of more-itertools and/or toolz?). Also, I’m pretty sure there are questions on StackOverflow and elsewhere from people trying to do step-compare and getting it wrong, which could be another argument for having it in the stdlib, or at least the recipes in the docs, but only if someone digs up the questions. And there can be bikeshedding discussions about exactly which operations are needed, and what to call them. But further artificial micro benchmarks aren’t going to help resolve any of those questions.

BTW, this thread seems a good place to mention the under-appreciated SortedContainers package from Grant Jenks: http://www.grantjenks.com/docs/sortedcontainers/ This supplies sorted containers (lists, dicts, sets) coded in pure Python, which generally run at least as fast as C extensions implementing fancy tree structures. The basic implementation "trick" is to maintain, under the covers, a list _of_ Python lists. Not deeper than that. It you're familiar with the lingo, it's most like a memory-resident 2-level B-tree. It scales extremely well, to billions of elements - to sizes the fancy packages can't get near, because their space overheads are so much higher they run out of RAM. There's extensive explanation of that here, which should be read to get a feel for just how much of modern processor behavior hasn't even been mentioned in _this_ thread ;-) http://www.grantjenks.com/docs/sortedcontainers/performance-scale.html Here's a tease: """ For comparison, consider traversing an AVL-binary tree with one billion elements. A highly optimized implementation will require at least 24 gigabytes of memory. The binary tree will likely traverse thirty levels, each of which is a data-dependent lookup. Some lookups will have good locality but most will not. Each lookup could be hundreds to thousands of times slower than sequential accesses. These slow lookups are why Sorted Containers can afford to shift a thousand sequential elements in memory and have most additions take less time than binary tree implementations. """ It was very much designed with cache behavior in mind, where a hit in L1 cache can easily be 100 times faster than needing to go to RAM. That said, the package's SortedSet still uses a Python set under the covers, because he didn't find anything that could beat that. The sorted order is maintained, under the covers, in a parallel SortedList. For that reason, adding an element to (or removing one from) a SortedSet takes O(log(N)) time rather than O(1).

Andrew Barnert wrote:
For A not in B: trying A 100000, B 500000 best relcomp_set test time (msec): 21.273 best relcomp_set with set init test time (msec): 143.083 best relcomp_set with set init intersect test time (msec): 79.064 best list_relcomp test time (msec): 35.025 best sort test time (msec): 9.160 For A not in B: trying A 1000000, B 5000000 best relcomp_set test time (msec): 282.933 best relcomp_set with set init test time (msec): 1795.730 best relcomp_set with set init intersect test time (msec): 1292.691 best list_relcomp test time (msec): 352.033 best sort test time (msec): 87.534 The first point shows my algorithm at about the same performance as a raw set op. So does the second. That's C code versus python. The time for the whole operation though is in another level. While typing I got another For A not in B: trying A 1000000, B 50000000 best relcomp_set test time (msec): 293.465 best relcomp_set with set init test time (msec): 19223.592 best relcomp_set with set init intersect test time (msec): 10237.699 best list_relcomp test time (msec): 995.477 best sort test time (msec): 735.441 this just continues the trend.
I'm not concerned about micro op, I'm concerned with macro op on large data sets. Right now if someone comes to you with that simple use case of two large lists you have to tell them to convert to sets and then back again. Forget the new programmers who tried the naive way and quite because "python is too slow", they are waiting 20 seconds when we could deliver it to them in 1 second. If I can get within 10x of C code that seems like it would be useful for that type of use case. I don't know if my case is common and I've used it on more esoteric problems that wouldn't expect anyone to do, I'm used to dealing with a lot of data, but if it can be of help I came it to make it available.

On Sat, Sep 21, 2019 at 7:54 AM Richard Higginbotham <higginbr@gmail.com> wrote:
I'm not concerned about micro op, I'm concerned with macro op on large data sets. Right now if someone comes to you with that simple use case of two large lists you have to tell them to convert to sets and then back again. Forget the new programmers who tried the naive way and quite because "python is too slow", they are waiting 20 seconds when we could deliver it to them in 1 second. If I can get within 10x of C code that seems like it would be useful for that type of use case. I don't know if my case is common and I've used it on more esoteric problems that wouldn't expect anyone to do, I'm used to dealing with a lot of data, but if it can be of help I came it to make it available.
If you directly construct a set from your source of truth (eg from a directory iterator, if you're looking at file names), you don't have to "convert a list to a set". Even if you did have to convert (or technically, to construct a set with the elements of that list), how many elements do you need before it's actually going to take an appreciable amount of time? I don't mean a measurable amount of time - a handful of microseconds can be measured easily - but a length of time that would actually make people consider your program slow, which probably means half a second. Now multiply that number of elements by, say, 64 bytes apiece (15-character strings in Python 3.4+), and see if that would mean you have other issues than the actual hashing. I think you're still looking at a microoptimization here. ChrisA

On Sep 20, 2019, at 14:51, Richard Higginbotham <higginbr@gmail.com> wrote:
Andrew Barnert wrote:
What’s the goal here?
I'm not concerned about micro op, I'm concerned with macro op on large data sets. Right now if someone comes to you with that simple use case of two large lists you have to tell them to convert to sets and then back again. Forget the new programmers who tried the naive way and quite because "python is too slow", they are waiting 20 seconds when we could deliver it to them in 1 second.
But today, you can deliver it to them in 1 second: just give them `set(b).intersection(a)` or `set(a) & set(b)` or `sb = set(b)` then `[x for x in a if x in sb]` and you’re done. They can easily understand why it works. If they want to know why it’s faster, you can easily explain it, and they’ve learned something widely useful. If we add a new function, whether it’s in builtins or itertools or elsewhere, then you can give them that function and you’re done; that’s no easier or harder. They probably won’t immediately know how it works, but if they ask, they’ll learn something useful. And once they get why it works, they probably will get why it’s faster. So, overall, the new solution is roughly the same benefits as the existing one that’s been there since Python 2.3.
Or you write the step-compare manually, the way C programmers have done for decades. Or you find a third-party library or copy and paste code off a Stack Overflow answer. Converting to set is easier, and usually the right answer. The important question is, for those cases where it _isn’t_ the right answer, is “write it yourself or find it on SO” good enough?
No you’re not. You’re trying to make the case that step-compare is better than set in all uses. Which isn’t true. And it isn’t necessary in the first place. The case that needs to be made is that it’s good for some cases that we can’t handle today, and that at least one of those cases matters, and that the current state of affairs where people have to go figure out how to do it is not good enough.
I've never come across another step and compare.
People do this all over the place. For example look at the merge step of any merge sort: it’s a step-compare union, possibly in-place, possibly n-ary instead of binary, but recognizably the same thing. In C++, the stdlib has all of these algorithms, abstracted to work on any forward Iterators, in two versions that handle duplicates differently, plus in-place versions of some of them.

Andrew Barnert wrote:
This isn't technically correct. It's not faster. It all depends on the use case which when it contradicts your expectations you just deride as "artifical micro benchmarks". Python isn't just used as a toy scripting language. 50 million elements in a collection is not even large by a long shot at least from where I sit. You can make a case that it's not a good language for that type of problem, say HPC clusters. Or you can tell people to go copy some C code "like people have been doing for decades". That you ask if that is a proper response to your users is very concerning to me.
So now all the scientist using Python also need to learn C, or at least be quick to copy, paste, or know the "secret" knowledge of using sets except for their problems it may be much slower.. so they'll also need to know all about pythons implementations of sets, cache misses, and those artificial micro benchmarks that give people a sad so we can't talk about. I'm actually OK with that as an answer as condescending and unprofessional as it may be.
If you ever find yourself arguing that you know what someone else is thinking better than they do, you should stop right there because you are probably wrong.
So someone can drop by in an attempt to help you so long as they come on hand and knee and prove to you that you need their help. I don't think that can ever work. When you start off by saying their use case doesn't matter it's clear you've already made up your mind. It's obvious you've taken this personally and would go through any hurdle to thwart a consideration of my proposal. That's OK, like I said it wasn't for me, it was for other people.
I thought we were talking about Python and algebraic set operations and I said after my previous work I never needed to..
I did exactly what you just said earlier I wrote a C extension for this. I suspect I could do it again, but as it would be an investment of time and effort so I thought I would try to get a consensus view. This process and in particular this discussion has convinced me that even if I want to go out of my way to help this is not the community to be doing it in. There are some good people here who are also very technical. You should talk to them because there is such a huge difference between say the response of Tim Peters and yourself that it's like night and day. That's all I have to say.

On 2019-09-20 9:52 p.m., Richard Higginbotham wrote:
Richard, I can identify with the need to intersect large sets; and I often faced with poor choices when it comes to dealing with them. I love Python, and I would love to use it for my data munging, but the data is too big to fit in memory and Python is too slow. I find it disappointing to convert an elegant expression declaring *what* I want, like "a-b", into code that declares *how* to do it, like "set(b).intersection(a)". This disappointment is magnified by the fact someone, somewhere already written code to do set subtraction faster, and I can not use it; my choices are quick ugly code, or a time consuming search for an elegant solution** on the internet. Maybe we are looking for a different type of solution. You seem to be asking for the elegance of Python with the expression optimization (aka query planning) of a database. A database may be able to deliver the speeds you require; it can pack data tighter; it has access to more processors; it may already leverage SIMD; it can optimize the operation according to how big the data is. I am suggesting a Python container implementation, like sortedcontainers but using a database (maybe Sqlite), may be a solution. Of course, there is the problem of moving data in and out of the database, but maybe that can be amortized over other operations, and made relatively insignificant. There is the delay when translating __sub_() call into SQL, but maybe that is relatively small compared to size of work we are requesting. May you link to your repo where these tests are run? I seem to have lost the link in the long chain of emails on this list. I am considering adding a Sqlite example, if only to prove to myself that it is the slowest option of all. Thank you. ** The sortedcontainers mentioned is an interesting demonstration of how fast Python can get when you are aware of L1 cache effects.

Kyle Lahnakoski wrote:
Hi Kyle, The repo is here: https://github.com/ponderpanda/listex I'm still looking into sortedcontainers and dominik's suggestion of numpy. Both seem promising. At the moment, I'm more concerned with in-memory operations. Disk access complicates things, but I think it would be interesting to include those. We've seen how memory access times can turn superior algorithms into inferior ones at scale. It's very likely that could be true when using disk access times. I'm not sure anyone would want to wait for that benchmark to finish though. My suspicion is that a hybrid approach would be the most optimal. It get's more complicated when you consider you could throw in more computers to increase in memory sizes. If you are going to use my routines for speed testing with large data sets I would try simple timers instead of using the bench mark set. It takes much longer to run the tests repeatedly and the timing precision at scale is going to be drowned out by the long run times. I've been considering parallelization / distribution (sharing cpu and memory) and I think I have a pretty good idea of how to do it. I just haven't needed with my work yet. I think set style (hashtables) would be less likely to benefit from parallelization but I haven't looked into it at all yet. Depending on the size of your datasets you might benefit more from that than relational databases.

I think there is a problem here using timeit for this type of problem, look at the large amount of time used by the set init time. Noticing this I changed my sorts to in place and got better results. I assume there is some copying of locals or other overhead but I can't explain the results. https://github.com/ponderpanda/listex/blob/master/listex_timit.py For A not in B: trying A 1000, B 5000 best relcomp_set test time (msec): 0.039 best relcomp_set with set init test time (msec): 0.247 best list_relcomp test time (msec): 0.280 For A not in B: trying A 10000, B 50000 best relcomp_set test time (msec): 0.618 best relcomp_set with set init test time (msec): 4.156 best list_relcomp test time (msec): 2.890 For A subset of B: trying A 100000, B 500000 best relcomp_set test time (msec): 12.664 best relcomp_set with set init test time (msec): 73.243 best list_relcomp test time (msec): 146.397 For A not in B: trying A 100, B 500 best naive_relcomp test time (msec): 0.428 best relcomp_set test time (msec): 0.004 best relcomp_set with set init test time (msec): 0.020 best list_relcomp test time (msec): 0.027 I still don't trust it but it seems more reasonable. It also lines up with my previous results a little better.

On Sep 19, 2019, at 13:18, Richard Higginbotham <higginbr@gmail.com> wrote:
This is exactly what timeit is for. It repeats the tests multiple times and aggregates appropriately, and it does half a dozen other things that you haven’t even thought of and therefore aren’t even trying to correct for. And of course it’s been tested through decades of widespread use.

It might be interesting to note that Numpy provides various set routines operating on their arrays (and hence lists as well, by conversion): https://docs.scipy.org/doc/numpy/reference/routines.set.html For [intersection](https://docs.scipy.org/doc/numpy/reference/generated/numpy.intersect1d.html) for example they do the following: 1. Concatenate the arrays, 2. Sort the result, 3. Compare subsequent elements for equality. Most likely because for each of the steps, there is a C extension that provides an efficient implementation. For [membership testing](https://github.com/numpy/numpy/blob/d9b1e32cb8ef90d6b4a47853241db2a28146a57d...), i.e. check which elements of `a` are in `b`, however, they use a condition where they decide that sometimes it's faster to use a loop over `a` and `b` instead of the concat+sort approach: if len(b) < 10 * len(a) ** 0.145 Not sure where they got the exact numbers from (maybe from benchmarks?).

Let me expand on the reasons for my post here. It seems like motivation has turned into a bit of a bike-shead. I have an algorithm I wanted to see if the python devs would be interested in. It fulfills my use case as is. I've found it useful for different problem areas and I haven't seen anything similar so I thought it might be useful to others. It's written in Python. It used to require containers with 100k or more elements to see a speed improvement over some other unnamed methods. These days its closer to 50 million with the only other contender being set (hash based) operations at least that I know of. I think there is a reasonable chance that if it were converted to C it would be comparable to if not better than the Set operations for people who wanted lists for what ever reason, even with much smaller data sets. One issue is that it is difficult to time test because it has side effects and because well computers. It's also difficult to know what users would put into a list so that we determine it's usefulness, either in relation to set operations or not. I'm used to thinking in lists so I don't have good judgement on how useful it would be to other people. I'm most concerned with getting feedback from the list to determine if it would fit any other use cases. If there is some kind of need for it as a general function. If no one ever uses lists with more than 100k elements then it's not worth the time to look into. There's nothing wrong with that. I just thought that if I could give back something a little unique for my years of using python it would be fun and worthwhile.

On Fri, Sep 20, 2019 at 11:44:17PM -0000, Richard Higginbotham wrote:
Let me expand on the reasons for my post here. It seems like motivation has turned into a bit of a bike-shead.
I don't think that anyone is questioning your motivation. (Although one of your recent replies to Andrew seems to be casting some pretty mean-spirited aspersions on *his* motivation: "[Andrew] would go through any hurdle to thwart a consideration of my proposal". And honestly, your personal motivation isn't really important. If you are doing it for the fame and money, you'll probably be disappointed. If you are doing it to give back to the Python community, that's great, but the community and the core devs have no obligation to accept it into the standard library. You can always put it on PyPI, you don't need to ask anyone's permission, just go ahead and do it. The core devs have a duty to curate the standard library. They cannot accept everything proposed, regardless of the author's feelings or motivations or how useful it is to them. Also their own personal selfish motivation that they are over-worked and under-appreciated as it is without taking on the burden of *yet another* library to maintain if it isn't justified. Part of the process for justifying that library or set of algorithms is to run the gauntlet of Python-Ideas, and for good or ill, most of the people here tend to be pretty conservative in technology matters, possibly even more conservative than the core-devs. The people here on Python-Ideas are a subset of the Python community, if you can't convince us that your idea is good, you probably aren't going to get much interest from the rest of the community either. Don't take it personally, it's just the nature of the Python community. If you want a community more open to change, you might consider the Node.js or PHP communities, or so I'm lead to believe. [...]
Am I misunderstanding you? You seem to be saying that your algorithm has become *slower*. In the past, your algorithm beat other unnamed algorithms for moderate sized lists of 100K elements, but now your algorithm doesn't beat them unless the lists are large, 50 million elements or so. So in relative terms, you seem to be saying that your algorithm is now 500 times slower relative to alternatives than it used to be. I don't think that's what you mean, so can you clarify?
Possibly. As we have seen from this thread, the benefits of locality can sometimes make up for a less sophisticated algorithm. If I've learned anything from this thread, it is that relying on Big Oh to predict performance is even less useful today than it used to be. [...]
You're on the wrong forum for those questions. Here, we are most concerned with whether or not the proposal is a good fit for the language and stdlib, and the person making the proposal is expected to have done their homework and determined other use-cases *first*, or at least as part of the process. If you want to maximise the number of people seeing your proposal, which will in turn increase your chances of getting feedback of potential use-cases, you might try posting it on somewhere like Reddit first. -- Steven

On Sat, Sep 21, 2019 at 1:44 AM Richard Higginbotham <higginbr@gmail.com> wrote:
You are right, it is definitely not easy to benchmark, which I realized after I digested what Tim and Andrew wrote about the string representation, hashing and caching: 1) There are at least two levels of separate meta-data constructs, before reaching the actual "const char*" for a string in a string list. 2) The hashes on strings are cached. 3) Sequentially created objects most likely end up in a consecutive memory, when you shuffle them, everything gets messed up. So I made a series of test to prove some of my theories. I am using timeit (because I believe it is the best I can do), but I am trying to use while taking into account the points mentioned above. The one consequence is that all timeit tests I run only in one iteration to avoid any side effects from cached hashing (I will demonstrate it on an example). I choose relatively large datasets (10M string lists) where the execution time should be sufficiently long to justify one iteration rule (my machine is not stressed at all at the time I run the tests). I choose the set difference operation (A - B) as the benchmark, which your code implements in `list_relcomp` function (I got the latest version of `listex_timit.py` from your repo). First let's setup the environment: In [2]: from os import urandom ...: from random import shuffle ...: from timeit import timeit ...: from listex_timit import list_relcomp Now let's see the impact of the string hash caching: In [3]: a = [urandom(8).hex() for i in range(10000000)] In [4]: timeit('set(a)', number=1, globals=locals()) Out[4]: 1.5521358999999961 In [5]: timeit('set(a)', number=1, globals=locals()) Out[5]: 0.9837615 What happens here is that I create a list of random strings, the first timeit measures the set operations where all the strings have to be hashed, the consecutive one measures only the build time for the "set construct", while hashing has already been done. The important point is that if I run the benchmark in several iterations, from the second iteration I will only be benchmarking the set construction, but not the hashing and with already 10 iterations it will basically eliminate the hashing time from the benchmark results. This may, or may not, actually correspond to the real use case scenario, so I am not saying it is apriori wrong, just that one has to pay attention to what he is benchmarking and if he is benchmarking what he thinks he is benchmarking. The second important point is a memory access pattern and the cache trashing. Let's see what happens when I shuffle the data: In [6]: shuffle(a) In [7]: timeit('set(a)', number=1, globals=locals()) Out[7]: 1.4697333000000015 Now even with hashes already cached the time went up by 50% again, because of the cache trashing. And if I start up from the fresh without any cached hashes and shuffle the data: In [8]: a = [urandom(8).hex() for i in range(10000000)] In [9]: shuffle(a) In [10]: timeit('set(a)', number=1, globals=locals()) Out[10]: 2.033926100000002 Quite a difference from the "best case". Let's see now, how the cache trashing impacts the lists (there is no hash caching on list ops so the time is invariant to the iterations): In [11]: a = [urandom(8).hex() for i in range(10000000)] In [14]: timeit('list(a)', number=1, globals=locals()) Out[14]: 0.13943980000001943 In [15]: timeit('list(a)', number=1, globals=locals()) Out[15]: 0.13443960000003585 In [16]: shuffle(a) In [17]: timeit('list(a)', number=1, globals=locals()) Out[17]: 0.25544789999997874 In [18]: timeit('list(a)', number=1, globals=locals()) Out[18]: 0.2587161000000151 With the combined knowledge from the previous tests, I can now run the actual benchmark for 'A - B' set difference, using the set operations and explain the results (I am converting it back to the list to make it equivalent to your `list_relcomp`): In [19]: a = [urandom(8).hex() for i in range(10000000)] In [20]: b = [urandom(8).hex() for i in range(10000000)] In [21]: timeit('list(set(a).difference(b))', number=1, globals=locals()) Out[21]: 3.8052587000000244 In [22]: timeit('list(set(a).difference(b))', number=1, globals=locals()) Out[22]: 2.6856725000000097 We can clearly see the impact from the hash caching on the time (I do not include the following timeit executions, but they were giving basically the same time from the 2nd iteration on). On your function it runs like this: In [23]: a = [urandom(8).hex() for i in range(10000000)] In [24]: b = [urandom(8).hex() for i in range(10000000)] In [25]: timeit('list_relcomp(a, b, True, True)', number=1, globals=locals()) Out[25]: 18.45155139999997 In [26]: timeit('list_relcomp(a, b, True, True)', number=1, globals=locals()) Out[26]: 7.87324250000006 In [27]: timeit('list_relcomp(a, b, True, True)', number=1, globals=locals()) Out[27]: 7.88726209999993 Let's say the first times are comparable as there was no data prep, the second times are also comparable, set ops have cached the hashes, you sorted your lists in place (which gives the significant improvement in the subsequent runs.). This may also skew your benchmark results significantly if you run it in iterations though (if the real use case does not preserve the sorting). Finally I add also the same exercise with the data shuffled, which would be possibly the worst case scenario, first the set ops In [28]: a = [urandom(8).hex() for i in range(10000000)] In [29]: b = [urandom(8).hex() for i in range(10000000)] In [30]: shuffle(a) In [31]: shuffle(b) In [32]: timeit('list(set(a).difference(b))', number=1, globals=locals()) Out[32]: 4.671694499999944 In [33]: timeit('list(set(a).difference(b))', number=1, globals=locals()) Out[33]: 3.6129220999999916 In [34]: timeit('list(set(a).difference(b))', number=1, globals=locals()) Out[34]: 3.424502699999948 In [35]: timeit('list(set(a).difference(b))', number=1, globals=locals()) Out[35]: 3.4897670999999946 then your `list_relcomp` implementation: In [36]: a = [urandom(8).hex() for i in range(10000000)] In [37]: b = [urandom(8).hex() for i in range(10000000)] In [38]: shuffle(b) In [39]: shuffle(a) In [40]: timeit('list_relcomp(a, b, True, True)', number=1, globals=locals()) Out[40]: 24.619852700000024 In [41]: timeit('list_relcomp(a, b, True, True)', number=1, globals=locals()) Out[41]: 7.913995600000021 In [42]: timeit('list_relcomp(a, b, True, True)', number=1, globals=locals()) Out[42]: 8.124123400000371 In [43]: timeit('list_relcomp(a, b, True, True)', number=1, globals=locals()) Out[43]: 8.000760899999932 I wrote all this to show that without an insight it might be sometimes difficult or impossible to do it right (I was caught myself in several pitfalls on my original benchmarks I posted here) and also because it was actually a fun to learn quite a few things from this thread. I am however not convinced that step-compare lists can compete with set operations (even when done on lists) at the current state of things. Richard M.

Richard Musil wrote:
When I first wrote the algorithms(2000 or so) I compared them to [x for x in a if x in b] and I found that the later was faster. I didn't make sense at first but it was because the 'x in b' is done in C (I had just graduated and started using Python). If the data set was large enough I could see the big O growth was greater for the C version versus the python version. Similar to what we see here in the difference between the set operations and listex. I implemented listex in C and it was much faster at all data sizes over maybe 10 elements. I don't expect the same speedup compared to sets because they are more efficient. I also can't guarantee that the trend of set operations having a higher time growth will hold true for all data sizes. I only point it out and say for some use cases set relation operations aren't optimal, but they may be good enough.

Note that CPython's sort is in the business of merging sorted (sub)lists. Any mergesort is. But CPython's adapts to take advantage, when possible, of "lumpy" distributions. For example, if you sort list(range(1000000, 2000000)) + list(range(0, 1000000)) it goes very fast (O(N)). Because it first rediscovers that the list is a catenation of two sorted sublists, then notices that because the smallest element of the left sublist is greater than the largest element of the right sublist, there's no need to compare anything else _between_ the sublists. The sorted result has to be the right sublist followed the left one. The same would work for a union operation if the sublists were given as distinct arguments. Similarly for intersection (which would deduce almost instantly that the intersection must be empty). And so on. That's not by accident - the inspiration for CPython's sort's basic "galloping" approach was taken from this paper, which wasn't about sorting at all: "Adaptive Set Intersections, Unions, and Differences" (2000) Erik D. Demaine, Alejandro López-Ortiz, J. Ian Munro That's concerned with manipulating large sets represented as sorted sequences. They noticed too that data was "lumpy" in real life, and that paper was the start of a number of papers trying ways to take advantage of that. So if someone _really_ wants to go gonzo with this, that's a promising approach. Do note, though, that it doesn't fit naturally with iterators. "Galloping" relies on O(1) access to data at arbitrary index positions; it's a fancier variant of binary search (see listsort.txt in your CPython source distribution for details). If cache misses can be so expensive, how can that win? Because the number of index accesses per galloping step is logarithmic in the number of sequence elements consumed, and "compare" steps can (very roughly speaking) be skipped entirely for all the sequence elements _not_ touched by the galloping indices. Pointers to those are merely copied, without even looking at the data they point to. Which often turns out to be a huge win in CPython, because comparison is CPython is expensive.

On Sep 22, 2019, at 15:28, Tim Peters <tim.peters@gmail.com> wrote:
I wonder if you could get any benefit by using something like a skiplist-rope hybrid, where you have effectively O(log n) random access rather than O(1), and can only copy huge chunks with O(log n) large memcopies instead of 1 huge one. Would those costs be enough to sink the benefits of galloping? It wouldn’t be usable in all cases, but there are many cases where your huge input is naturally iterable in this way (a set of files to concatenate, a huge file that you can seek in or mmap chunks of, maybe even a semi-lazy rope that you got from some Rust or Haskell library via ctypes). If you know the size of each iterator, you can build the first level or an approximate skiplist, and build the rest up on the fly to approximate binary search well enough (I think) for the purpose. Anyway, if I really did want to do these operations on massive datasets, I’d look at pandas+HDF5 to see what it can give me before considering implementing anything from scratch. But it could be a fun project… A few further half-formed thoughts on this: For a single huge file treated this way, you’d probably want to look at ripgrep. The author started off with the conventional wisdom on when to read vs. mmap vs. windowed-mmap and quickly discovered that all of the conventional wisdom was so badly out of date as to be useless… Could you get any benefit out of parallelizing the gallop seekaheads? Probably not for 1-10 million, but we are talking about huge datasets here. If you have an Iterator extended with a nextn method that can iterate a large chunk of data at a time (which a file can do if you’re treating it as a sequence of bytes or characters, but if you’re treating it as a sequence of lines it can’t, at least without a second pass…), you could build the same thing—but the cost is that you might easily end up using temporary N/2 memory, which defeats most of the benefits of using arbitrary iterables. C++, Swift, etc. don’t have a notion of a “bisectable Iterator”, where like a bidirectional Iterator it still takes O(m) time to jump forward or backward m steps, but it only takes O(1) time to jump (approximately?) halfway between two known positions. You could easily build that on top of anything random-access, and also on top of anything logarithmic like a tree or skiplist, and it seems like there are useful things you can do with it. Or maybe almost all such things are worth specifializing differently for random-access vs. inherently-logarithmic data structures, or maybe it’s just too hard to come up with a proper useful abstraction?

Tim Peters wrote:
That's funny as that was about the same time as I was working on my list compare. I actually used a binary search as well. I divided the 'merged' list into three sections a head, body, and tail. The head is where the values in one list are totally outside the other list on the 'low' (in sort order) side. The body was where the two lists started to overlap and I used the minimum of the two lists and a binary search to find that. The tail was the opposite end so there was no need to use a binary search to find where it started (its where one list ends). I was very surprised I could sort the list and step through it with any speed improvement. No matter how I changed the initial list state it was very fast. I expected faster on the upper end with large lists, but not on the lower. My first time using sort and my second python program and I lucked out. Good times!

On Mon, Sep 23, 2019 at 12:21 AM Richard Higginbotham <higginbr@gmail.com> wrote:
I believe that your "suspicion" is correct and the set variant is executed mostly in C code (though I do not have any intimate knowledge of the Python implementation, so I my be wrong), while your code is executed in bytecode and I also believe no one here claimed otherwise. This however is not a problem of `timeit`, which main feature is that it is designed for benchmarking and available in the standard library. So anyone can use it, and the results are (somewhat) comparable. If you design your own benchmarking tool (I do not think datetime is a good candidate though), anything you gain you lose on the reproducibility and comparability. Besides, saying this benchmark does not give the results I expect, would probably need also explaining what do you expect and why.
So far the benchmarks showed that that list implementation was slower than the sets at every non-trivial case, sometimes by quite a margin. I am not saying that the C implementation cannot be faster, in fact, I will probably be eager to test it, if you come up with one. But after what I read (and posted) in this thread I am also reluctant to accept that C implementation _must_ be faster (or must be significantly faster). Maybe, your code, even when run in Python, is well optimized already, and won't get a significant speedup or will be 5x faster, but, honestly, I do not dare even to guess. The only way to prove it is to actually implement it in C. Considering your use case however, I wonder, if you would not be better going with the iterator approach (as Andrew has been hinting already for some time in his posts). Richard M.

Richard Musil wrote:
The problem i have with timeit is actually just side effects. The same issue comes into play with sets if you are prebuilding them. If I use timit to run my algorithm 10 times the first execution takes the most time, and it sorts the lists, later executions are faster but I don't want those in the results. I'm looking for avg to worst case results. In the case of comparing c code with byte code we've already given set comparisons a huge boost by choosing simple hashable items and a C implementation. trying to optimize it even further by using the least amount of statements possible starts to just be silly. I only said my algorithm crossed a threshold of speed at some number of elements. That should be expected of every algorithm. It's not an indictment of the set library or Python. Lots of perfectly good algorithms don't scale well, mine obviously doesn't scale small very well.

On Sep 23, 2019, at 15:32, Richard Higginbotham <higginbr@gmail.com> wrote:
The code you posted doesn’t do any galloping, it just advances the index one at a time. The fact that you could do approximate binary searches to optimize it doesn’t help if you never do those searches. So it’ll be exactly as poor on unbalanced inputs as my Iterator code. After all, they’re exactly the same algorithm (except for trivial differences in how they handle one input or the other ending early); the only difference is that one implementation calls next and remembers the value, the other increases an index and indexes a list. Of course if you always have lists already in memory, the added overhead of using an Iterator over the list is cost for no benefit. But that’s a fixed multiplier, it’s not different for balanced vs. unbalanced inputs. And if course if the Iterator ever allows you to avoid building an otherwise-unnecessary list, it’s a win, but that win is also a flat cost that depends only on the allocation time for N elements, which also doesn’t vary based on balanced vs. unbalanced inputs. Anyway, if your use case really is a gigantic directory from a filesystem that guarantees sorted order and it supports iterdir, iterators should help quite a bit.

Andrew Barnert wrote:
In the spirit of comradery, I'll try to follow Steven D'Aprano's community advice and be a little more blunt. I'll skip the insults though. I wrote them but I took them out. Honestly, I could have done this at any time I just didn't want to hurt peoples feelings. I already converted it to C along time ago and did many many test I don't want to repeat. I know how it scales. We don't even have to convert to C. Let's just use pypy. For grins we'll change to an integer hash. For A not in B: trying A 1000, B 5000 best relcomp_set test time (msec): 0.009 best relcomp_set with set init test time (msec): 0.130 best relcomp_set with set init intersect test time (msec): 0.115 best list_relcomp test time (msec): 0.012 best sort test time (msec): 0.006 For A not in B: trying A 10000, B 50000 best relcomp_set test time (msec): 0.117 best relcomp_set with set init test time (msec): 2.448 best relcomp_set with set init intersect test time (msec): 2.649 best list_relcomp test time (msec): 0.077 best sort test time (msec): 0.048 For A not in B: trying A 1000000, B 5000000 best relcomp_set test time (msec): 82.131 best relcomp_set with set init test time (msec): 880.753 best relcomp_set with set init intersect test time (msec): 886.761 best list_relcomp test time (msec): 7.902 best sort test time (msec): 5.826 For A not in B: trying A 5000000, B 10000000 best relcomp_set test time (msec): 557.368 best relcomp_set with set init test time (msec): 3228.475 best relcomp_set with set init intersect test time (msec): 3725.047 best list_relcomp test time (msec): 25.626 best sort test time (msec): 13.632 Huh? At about 1000 elements they are equal but after that my 2nd python program is 10x faster, oops. Lets try more iterations "with a real Benchmark program". I won't repeat mine though well just take the first number that comes by. trying A 1000, B 5000 best relcomp_set test time (msec): 0.009 best relcomp_set with set init test time (msec): 0.116 best relcomp_set with set init intersect test time (msec): 0.118 best list_relcomp test time (msec): 0.015 best sort test time (msec): 0.006 For A not in B: trying A 10000, B 50000 best relcomp_set test time (msec): 0.115 best relcomp_set with set init test time (msec): 2.729 best relcomp_set with set init intersect test time (msec): 2.780 best list_relcomp test time (msec): 0.111 best sort test time (msec): 0.049 You don't want to see data sets higher than that. Ok, a little sarcasm but you earned it. All you had to do was not be jerks.

On Sep 21, 2019, at 04:35, Richard Musil <risa2000x@gmail.com> wrote:
I wrote all this to show that without an insight it might be sometimes difficult or impossible to do it right (I was caught myself in several pitfalls on my original benchmarks I posted here) and also because it was actually a fun to learn quite a few things from this thread. I am however not convinced that step-compare lists can compete with set operations (even when done on lists) at the current state of things.
One case I think you didn’t test is when the strings are generated in already-sorted order. In that case, as opposed to the case where you generate in random order and then sort, I think the PyUnicode objects and the actual character arrays will both be mostly contiguous (at least in a fresh process)—although I think the only way you could verify that is by using ctypes (or a C extension) to peek into the internal PyUnicode structs and look at the appropriate pointers. Anyway, the access pattern should more than be simple enough for the CPU to effectively cache-prefetch at all three levels, which could be a significant speedup for the sorted-list algorithms over the set algorithms. And this isn’t entirely an artificial scenario; sometimes large numbers of strings really are generated in sorted order and in large batches like this—e.g., the filesystem, or a network client, or whatever is handing you data in sorted order and you put it right into a list of strings. So, it might be worth testing. But I think the real win for already-sorted data is laziness. With sets, at least one of your data sets has to be strict, so you can put it in a set. With sorted lists, with a minor rewrite to the algorithms (which I posted earlier for intersection), you can use them directly on iterators instead. Imagine you’re reading 7GB worth of (already-sorted) strings and comparing to 2GB worth of strings on a machine with 8GB of RAM and plenty of swap space. Clearly the set, sorted list, or even anything clever that pandas might do, they’ll all throw your machine into swap hell, which will probably swamp the cost of using step-compare instead of sets and of using Iterators instead of lists. This one might be harder to benchmark (because the cost of the Iterator is going to end up threaded through the cost of the algorithm), but it seems like it should be a huge win over sets.

On Sun, Sep 22, 2019 at 8:25 PM Andrew Barnert <abarnert@yahoo.com> wrote:
I did not benchmark it, because the original code from the OP has an explicit sorting built-in, so I simply concluded this was not the case. Nevertheless...
For the sake of completeness I did some benchmarks also with already sorted strings. I chose the following strings: In [3]: a = [f'String#{i:08}' for i in range(0, 20000000, 2)] In [4]: b = [f'String#{i:08}' for i in range(1, 20000001, 2)] In [5]: a[:5] Out[5]: ['String#00000000', 'String#00000002', 'String#00000004', 'String#00000006', 'String#00000008'] In [6]: b[:5] Out[6]: ['String#00000001', 'String#00000003', 'String#00000005', 'String#00000007', 'String#00000009'] First with the set ops: In [7]: timeit('list(set(a).difference(b))', number=1, globals=locals()) Out[7]: 3.9231181000000106 In [8]: timeit('list(set(a).difference(b))', number=1, globals=locals()) Out[8]: 2.642592999999991 In [9]: timeit('list(set(a).difference(b))', number=1, globals=locals()) Out[9]: 2.648031700000004 Does not really differ from the previous results. Then with the `list_relcomp` (note 2x `False` in the argument list to disable the sorting): In [10]: a = [f'String#{i:08}' for i in range(0, 20000000, 2)] In [11]: b = [f'String#{i:08}' for i in range(1, 20000001, 2)] In [12]: timeit('list_relcomp(a, b, False, False)', number=1, globals=locals()) Out[12]: 4.851956800000011 In [13]: timeit('list_relcomp(a, b, False, False)', number=1, globals=locals()) Out[13]: 4.8776169999999865 One interesting thing I noticed, when running `list_relcomp` with a trivial input, e.g. (where a = c, but in different objects): In [16]: a = [f'String#{i:08}' for i in range(0, 20000000, 2)] In [17]: c = [f'String#{i:08}' for i in range(0, 20000000, 2)] In [18]: timeit('list_relcomp(a, c, False, False)', number=1, globals=locals()) Out[18]: 2.8997035999999525 It ends up running much faster, because it does not have to build the resulting list (set implementation performance does not change much in this case) while in the first setup, everything in A goes into the result. So there are many details, but unless the use case can guarantee them, I am not sure they are worthy the speculations. But I think the real win for already-sorted data is laziness. With sets, at
If you move everything to the iterators and maybe even dilute the operations over some async I/O it may be pretty effective, but when talking about the algorithm alone, I am not sure even the pre-sorted lists have an advantage, at least not in this implementation, maybe when implemented in C? Richard M.

On Sep 22, 2019, at 12:50, Richard Musil <risa2000x@gmail.com> wrote:
On Sun, Sep 22, 2019 at 8:25 PM Andrew Barnert <abarnert@yahoo.com> wrote:
One case I think you didn’t test is when the strings are generated in already-sorted order.
For the sake of completeness I did some benchmarks also with already sorted strings.
I’m a bit surprised that it didn’t make a huge difference—but only a bit; it really is hard to predict performance when you get down to this level of nitpicking allocation with this many layers of abstraction in the way.
But I think the real win for already-sorted data is laziness. With sets, at least one of your data sets has to be strict, so you can put it in a set. With sorted lists, with a minor rewrite to the algorithms (which I posted earlier for intersection), you can use them directly on iterators instead. Imagine you’re reading 7GB worth of (already-sorted) strings and comparing to 2GB worth of strings on a machine with 8GB of RAM and plenty of swap space. Clearly the set, sorted list, or even anything clever that pandas might do, they’ll all throw your machine into swap hell, which will probably swamp the cost of using step-compare instead of sets and of using Iterators instead of lists. This one might be harder to benchmark (because the cost of the Iterator is going to end up threaded through the cost of the algorithm), but it seems like it should be a huge win over sets.
If you move everything to the iterators and maybe even dilute the operations over some async I/O it may be pretty effective, but when talking about the algorithm alone, I am not sure even the pre-sorted lists have an advantage, at least not in this implementation, maybe when implemented in C?
You don’t really have to do anything to move the operations to the iterators. That’s the whole beauty of generator expressions, itertools, etc.: pile on as many Iterator transforms as you like, including things like union or difference, and the resulting Iterator automatically interleaves the work lazily on demand. Which works even if the inputs are coming in from some kind of external IO. (Making it work with push-model asyncio or similar frameworks does require rewriting things slightly, however.) I slapped together a complete (but barely tested, so you might want to verify correctness) implementation at https://github.com/abarnert/iterset if you want to benchmark it. It’s not at all optimized (in fact, there will usually be two extra generators in the path to each value, plus the overhead for checking and not calling a nonexistent key function, etc.), so I’m sure it’ll be slowed down by a considerable constant modifier for the in-memory case. But if you run it against two generators of strings, compared to the cost of building a set or two lists with the whole thing in memory, I wouldn’t be surprised if it overtakes them before you even get to swap hell or MemoryError. (Assuming you consume the result iteratively, of course.) And it will definitely beat them when you do get to that size. Of course if you really do need to intersect two massive sorted inputs that aren’t already in memory, I’ll bet there’s some kind of pandas/HDF way to do it that buffers up large chunks and optimizes the hell out of everything and blows my code away.

Let me go back to the top here. On Sep 18, 2019, at 12:29, Richard Higginbotham <higginbr@gmail.com> wrote:
I have frequently come across cases where I would like to compare items in one list in another similar to relational algebra.
I’ve put together an itertools-style implementation at https://github.com/abarnert/iterset if anyone’s interested. It includes merge, union, intersection, difference, symmetric_difference, issubset, and issuperset, all taking two sorted iterables and a key function. While all of these things (except merge) can be done by just putting one iterable into a set (or Counter, if you need dups) and passing the other to a method, there are plenty of reasons to prefer these functions in some cases: preserving input order, taking key functions, working on non-hashable values, chaining up itertools-style with other Iterator transforms, etc. Notice that the C++ standard library has exactly the same set of functions at the same level of flexibility (they take C++ iterator ranges, and an optional less-than function), despite having methods on (both unordered hash and ordered tree) sets, for much the same reasons. Also notice that itertools already has at least one function that expects to work on sorted iterables, groupby. And that in general itertools is intended to provide an “iterator algebra” that this fits into nicely. So, I think they’re useful enough to exist somewhere, possibly in itertools, possibly in a third-party library (although maybe as part of more-itertools and/or toolz rather than on their own). It’s barely tested, not at all optimized, not fully documented, and generally not ready to turn into a patch to the stdlib, more-itertools, or toolz or into a PyPI package. But if anyone wants it to be polished up to that level, let me know and I could probably find time this week. (If not, I’ll just keep it in my toolbox; and maybe come back to it next time I need it.) I haven’t tested performance at all. I expect it to be slower (when used with already-sorted data that’s already in a list) than Richard Higginbotham’s list implementation by a potentially hefty multiplier. From a quick&dirty test, I think the multiplier can be brought way down (albeit at significant cost to readability), but it’ll still be higher than 1. And that’s already slower than sets for at least most uses, as Richard Musli has demonstrated. Of course for huge data where reading everything into memory means swap hell, it’ll be a whole lot faster, but if you really need that, I’ll bet pandas/HDF5 has even better alternatives. Interleaving the work with the cost of reading or constructing the values might help performance in some cases, but that’s hard to generalize about. Overall, you probably want them when you need to preserve input order, or use a key function, etc., not when sets aren’t fast enough.

Hey Everybody, I do want to thank you for this souvenir. I've having it blown up and mounted on the wall. I'm going to leave some space for the benchmarks that compare my module and his with the standard set implementation. "I haven’t tested performance at all. I expect it to be slower (when used with already-sorted data that’s already in a list) than Richard Higginbotham’s list implementation by a potentially hefty multiplier. From a quick&dirty test, I think the multiplier can be brought way down (albeit at significant cost to readability), but it’ll still be higher than 1. " -- Andrew Barnert, Python project. See Andrew that's how you do sarcasm. None of that is a lie, I'm really gonna do it.

What if, instead, there was an `OrderedBag` class that acts like a `list` and supports set operations. The advantage of this is that any hidden indexing that is used to optimize the operations could be efficiently created in the object that results from the set operation during its construction while performing the operation. Using lists, any such indexes would have to be produced and discarded for each operation (or could perhaps be kept in weak references or something like that, but that's messy).

On Oct 13, 2019, at 19:15, Steve Jorgensen <stevej@stevej.name> wrote:
What if, instead, there was an `OrderedBag` class that acts like a `list` and supports set operations.
The advantage of this is that any hidden indexing that is used to optimize the operations could be efficiently created in the object that results from the set operation during its construction while performing the operation. Using lists, any such indexes would have to be produced and discarded for each operation (or could perhaps be kept in weak references or something like that, but that's messy).
This sounds doable, but nontrivial to design and implement, and with a lot of tradeoffs to consider. I’m not sure there is a one-size-fits-all implementation that would be good enough for most uses, but if there is, you probably need to find it and at least give us a proof of concept. For example, when I’ve needed to do the kind of thing you seem to be talking about, it’s often because I have a (multi)set of values with different orderings on it. Like a database table with multiple indices. If you don’t have any ordering except for a partial ordering on hash/equality, you probably wouldn’t want the same thing I do.

Steve Jorgensen wrote:
Lets look at the test results from earlier for 50 million and 10 million item lists. For A not in B: trying A 1000000, B 5000000 best relcomp_set test time (msec): 82.131 best relcomp_set with set init test time (msec): 880.753 best relcomp_set with set init intersect test time (msec): 886.761 best list_relcomp test time (msec): 7.902 best sort test time (msec): 5.826 For A not in B: trying A 5000000, B 10000000 best relcomp_set test time (msec): 557.368 best relcomp_set with set init test time (msec): 3228.475 best relcomp_set with set init intersect test time (msec): 3725.047 best list_relcomp test time (msec): 25.626 best sort test time (msec): 13.632 Changing the order of set operations based on which set is largest is almost 10x the speed improvement. Should set operations be changed? It's probably faster no matter the order to convert them into lists and use my algorithm after about 10 million. In the end though we are still talking about a few seconds at most. If it worth it to you for that improvement that's one thing, but you should consider what you really need. Without clear performance/use case criteria I think you'll never leave the bike-shed here.

On Thu, Sep 19, 2019 at 03:26:27PM +1000, Chris Angelico wrote:
Sets are unordered and can contain no duplicates. Lists are ordered and can contain duplicates, so you might think of them as "ordered multisets". I can't say I'm very enthusiastic about that "sorta, sortb" API, and I'm rather surprised about Richard's claim that a pure-Python list intersection is faster than set intersection written in C (but maybe Richard is a better programmer than me). But I'd like to hear more about this proposal. -- Steven

On Thu, Sep 19, 2019 at 10:32 PM Steven D'Aprano <steve@pearwood.info> wrote:
With the sorta/sortb API and the algorithmic assumption that the heads of both lists can be checked, I think the value of element order has been lost, so the only significance is the possibility of duplicates. The given use-case of file names doesn't justify this. So I'm curious what the OP's actual use-case was for using lists here. ChrisA

The value of the order is actually the critical part. If they are both ordered I can complete all comparisons in one pass over both lists at the same time. Lets think about in A and not B a = [1,2,3,6] b = [1,4,5,6,7] start at a[0] and b[0], 0 -- a[i] == b[j] so increment both indexes. 1 -- a[i] < b[j] so b cannot contain a[i] this is due to sort order, append a[i] to result and increment i 2 -- a[i] < b[j] so b cannot contain a[i] this is due to sort order, append a[i] to result and increment i .... every round of comparisons increments at least i or j if not both. That makes the run time roughly len(a) + len(b) The original use case was detecting new files in various directories. Each directory was expected to have millions of 4KB files and maybe at most 100k of new files. This was in around 2002 and it was a huge time savings for me. I'm not even sure that set's were available then. Originally it was a C API extension. Computers have gotten faster enough that the speed difference doesn't matter if you convert to and from sets unless we are talking about millions of elements in a set. The original source was lost and the pure python version has been adequate for anything I've needed in the last 10 years. The algorithm's are actually very simple and short (maybe 200 lines total). I don't know if something like this would be useful for others, or it might be useful for a specific group like scipy with very large datasets. It might / probably? even be adapted to be used in the built in set operations.

On Fri, Sep 20, 2019 at 5:11 AM Richard Higginbotham <higginbr@gmail.com> wrote:
The original use case was detecting new files in various directories. Each directory was expected to have millions of 4KB files and maybe at most 100k of new files. This was in around 2002 and it was a huge time savings for me.
I can't help thinking that the time for such a job would be dominated by disk operations here :) If a single directory has quote "millions" of files, you're going to see some pretty significant differences based on which file system you're using; not every FS copes well with gigantic directories. ChrisA

That's exactly what I expected. It's the opposite though. The time to check one list for all the elements in another list was far longer. Granted this was a Linux box using ext2. The other helpful element is that the filesystem returned the files in the proper order. One of the tests runs I mentioned was A 100000, B 500000 because that was a realistic scenario. If you actually try that with the naive method you will be waiting for a very long time to get a result, even today. Using sets, if you know to use sets, then its only a 10th of a second. I think both those methods are polynomical run time so I was pretty proud of my linear solution at the time.

On Fri, Sep 20, 2019 at 6:31 AM Richard Higginbotham <higginbr@gmail.com> wrote:
That's exactly what I expected. It's the opposite though. The time to check one list for all the elements in another list was far longer. Granted this was a Linux box using ext2. The other helpful element is that the filesystem returned the files in the proper order. One of the tests runs I mentioned was A 100000, B 500000 because that was a realistic scenario. If you actually try that with the naive method you will be waiting for a very long time to get a result, even today. Using sets, if you know to use sets, then its only a 10th of a second. I think both those methods are polynomical run time so I was pretty proud of my linear solution at the time.
Please quote some text to give context. WHAT is exactly what you expected? Building a set out of a series of values takes linear time, and set intersection also takes linear time. On what basis do you say that "both these methods" (again, not sure what you're referencing since you didn't quote any text) have polynomial run time? ChrisA

Chris Angelico wrote:
My recent guess was that set operations would be greater than nonlinear as I expected the size of both lists to play a part in the time complexity. If nothing creating the hash tables would take additional time. Looking at the post by Tim Peters, your self, and others I've come to the conclusion that shouldn't be the case. That seems to hold true when A is small, but not as A increases. It doesn't explain the results so I plan on looking into it further this weekend.

On Sep 19, 2019, at 12:09, Richard Higginbotham <higginbr@gmail.com> wrote:
It might / probably? even be adapted to be used in the built in set operations.
How? I’m not sure if you get how sets work, and maybe you think they’re based on a red-black tree or skiplist or some other logarithmic collection, like C++ and Java sets? It that were true, then implementing set.intersection as a step-compare instead of just {x for x in self if x in y} would be an optimization—it would replace O(nlogm) time with O(n+m). (But then such a set would almost certainly already be using step-comparison—or, in fact, would already be optimized it with subtree pruning, which has the same worst case but better best case—before anyone added it as a builtin.) But sets actually use a hash table, just like dicts do. This means they work on objects that can’t be compared (e.g., you can throw a bunch of complex numbers in a set). So you couldn’t possibly use an algorithm that requires comparison to implement a set method, because it would raise a TypeError on perfectly valid sets. This also means that lookup is an O(1) operation, not O(logn). So, intersection is O(m) time, which is better than your O(n+m), not worse. It also means that insert is amortized O(1) time, so you can build a set in O(n) time. So, if the collections aren’t prepared in advance, building a set to intersect with an iterable is O(n+m) time, which is better than sorting two iterables to step-compare them in O(nlogn+mlogm+n+m) as you’re proposing.

Its faster than the built in set type. Uniqueness of values isn't a requirement. The values need to be comparable so that they can be sorted. This allows a linear time complexity. That's basically the only restraint on the values. Here are some speed test results and pseudocode. def naive(a,b): res = [x for x in a if x not in b] def set_test(a,b): aset = set(a) bset = set(b) res = aset - bset res= list(res) For A not in B: trying A 10000 elements B 50000 elements naive test 10.285029 s set test 0.008001 s listex test 0.004000 s For A subset of B: trying A 100000 elements B 500000 elements naive too long to wait :-) set test 0.125013 s listex test 0.054005s

On Fri, Sep 20, 2019 at 5:08 AM Richard Higginbotham <higginbr@gmail.com> wrote:
Its faster than the built in set type. Uniqueness of values isn't a requirement. The values need to be comparable so that they can be sorted. This allows a linear time complexity.
Note that the linear time complexity assumes that they have *already* been sorted (not just that they *can be* sorted), as otherwise the process will incur an O(n log n) sort step first. ChrisA

I meant a near linear time complexity. n log n is very very close to n even when n is large. It's not constant, if it were it could be dropped entirely... It's pretty close though.

On Sep 19, 2019, at 12:07, Richard Higginbotham <higginbr@gmail.com> wrote:
Its faster than the built in set type.
Not according to timeit. If you want to dispute that, you need a proper benchmark, not just repeating the claims from your inaccurate one. I’m not sure what your relationtype is supposed to do (especially given than 1 and 2 produce the same values), and all versions seem like unusual use cases rather than anything approaching realistic, but I left that all alone and just replaced your benchmarks with timeit, repeated them enough times to actually be meaningful, and changed the test to the operation you claim it’s about rather than a different one. The output pretty clearly shows that a pre-existing set beats a pre-sorted list, and building a set on the fly beats sorting a list on the fly, and even comparing blatantly unfairly with building a set on the fly vs. using a pre-sorted list, the set still wins except in the smallest test. I didn’t include tests against more realistic datasets where your algorithm does significantly worse; this shows that even your chosen input is slower with your algorithm if you measure it properly. And, while we’re at it, I added a version of your algorithm that works with arbitrary iterables and also takes a key function (which you always want when you’re doing anything related to sorting in Python), and it’s about 10-20% faster than your list version, which shows that you aren’t getting any performance benefit from restricting yourself to lists only. See https://pastebin.com/BC2zgtN6 for the code.
Uniqueness of values isn't a requirement.
Of course. But non-unique values make the set algorithm faster, and they make your algorithm slower. And nearly-unique values don’t affect the set time, but they make your algorithm slower. So you can’t just ignore it. And, again, you have to decide what happens with non-unique values, how many times they appear in the output.
The values need to be comparable so that they can be sorted. This allows a linear time complexity.
Except that sorting is log-linear, not linear, so it doesn’t. Meanwhile, set actually _does_ allow linear time complexity. Again, none of this means your code is useless. I think a step-compare intersection is useful enough to belong in itertools, at least as a recipe. But it’s not faster than set, or simpler, or anything else you’re arguing.

On Sep 18, 2019, at 12:29, Richard Higginbotham <higginbr@gmail.com> wrote:
I'm not sure if there is any interest by others but I have frequently come across cases where I would like to compare items in one list in another similar to relational algebra.
For anyone who doesn’t want to read this whole reply: I think a step-compare function is useful often enough. and confuses novices often enough (if StackOverflow is any guide), that it might belong in the stdlib. But I think it should be in itertools, and should work on arbitrary iterables that are assumed to be pre-sorted (an assumption groupby already makes, and explains in the docs). And if it’s not in more-itertools and/or toolz, maybe it’s worth submitting it to them first and seeing how much people use it. Now on to why: To start with, itertools is already about providing an iterator algebra that isn’t identical to relational algebra, but is a lot closer than what the list builtin provides.
For example are the file names in A in B and if so return a new list with those items. Long story short, I wrote some functions to do that. They are quite simple and fast (due to timsort in no small part).
Even the plain python code is faster than the built in set functions (afaik).
You’re testing pretty small values in a very narrow use case, and testing them inaccurately by using datetime instead of timeit (the fact that some of your results are 0 vs. 0 should be a clue…), and you haven’t even given us the results, just claimed that they’re faster. When I run my own test on lists of 10000 strings made up 5 random lowercase characters, I get 485us for set(a).intersection(b), and 6609us for list_intersection(a, b), so it’s actually more than 10x slower. What if the values are already sorted, or already in sets? Then I get 175us for a&b, 3250us for list_intersection(a, b, False, False), so again it’s more than 10x slower. Maybe it’s because my set has a lot more duplicates. What I do 50-character strings, and verify that there are no dups? Now it’s 525us vs. 6699us, or 225us vs. 2910us for pre-prepared values—a bit closer, but still more than 10x slower. And meanwhile, what if I simulate long pathnames with a common prefix? Now I get 525us vs. 10200us.
How can that be true? Sorting a takes nlogn steps, sorting b takes mlogm steps, step-comparing then takes n+m steps, so that’s O(nlogn + mlogm * n + m) = O(nlogn), which is bigger than O(n). By comparison, making a set of b takes m steps, iterating over a and looking up each value in the hash takes n steps, so it’s O(n+m) = O(n), which is smaller than O(nlogn). It’s possible that the constant multiplier for hash lookups is so much higher than the multiplier for comparing and nexting that you’d need to get into the zillions before the logn factor outweighs the multiplier, in which case you could argue that it’s a good optimization despite having worse complexity. But you can’t argue that it has better complexity. Also, if that lower constant really is critical to performance, surely it’s not applicable to all values, as the long-pathname case demonstrates. Plus, given that your code is about 10x as slow even _without_ the sort, it doesn’t seem to be true that the multiplier is better in the first place. Also, forgetting about performance, sort-and-step doesn’t work on values that aren’t comparable, while set does. On the other hand, set doesn’t work on values that aren’t hashable, while sort-and-step does. Strings and integers are both, but lots of types are only one or the other (list, complex). For something general-purpose enough that you want to add it to builtins, that seems at least worth mentioning.
if they are over some thousands the speed difference with set() operations becomes significant. There are some details about naming, unique values (not necessary), and sort order that would probably need to be ironed out if they were to be included with the built in list type.
Why would you want this as a method of list rather than a function that works just as well on any sortable iterables? The reason we have a sort method (as well as, not instead of, a sorted function) is to allow sorting in-place, which isn’t relevant here. We don’t have methods for zip, filter, etc. Also, in most cases where I’ve wanted to intersect two lists of file names, I want to preserve the order of the first “master” list. You can’t do that with sort-and-step except by doing something pretty complicated (e.g., decorating with enumerate, using a custom compare that ignores the indexes in your intersection, then re-sorting the result by index and undecorating), but it’s trivial with sets: filterset = set(b) c = [val for val in a if val in filterset] … or … c = filter(set(b).__contains__, a) And this is still O(n+m) time, just like set(a)&set(b) would be, and with a faster constant because we’re only setifying one of the two iterables (and often the smaller one). And this has the advantage that the potentially huge “master” input can be a lazy Iterator. If a is a file listing the first names of each American, and b is a much smaller set of the legal names for French children, loading that whole file into memory to sort it would be a bad idea, and it’s not necessary if you use sets. Also, the last time I wanted to intersect two lists of file names was to diff two directories, so I needed a-b and b-a, not just a&b. Is there really a compelling use case for intersection that isn’t also a compelling use case for at least some of the other set operations? In fact, you’re arguing for just intersection, but then provided an implementation for some of the others. Which do you want, and why that particular set? Also, if you care about duplicates (and if you don’t care about either order or duplicates, why aren’t you just using a set in the first place?), you need to specify whether this is multiset intersection (so if X appears 5 times in A and 2 times in B it shows up 2 times in A&B), or set intersection (so it shows up just once). The most obvious way of doing sort-and-step is going to do neither of those (it’ll be 5, because it appears 5 times in A and each of those has a matching value in B); you could implement either pretty easily, but only if you decide which you want and then do it. With sets, you explicitly make the choice of set or Counter, and then whether you get set or multiset behavior is obvious; here, it’s not obvious, and if you want it to be a choice you have to add another flag argument.

I'm sorry I haven't used the mail list before and didn't send my comments to the list. Please see my response to Chris for some example times. I'm reluctant to use anything less than about a ms for timing purposes. There's too many interactions at the us level that can skew results. I'd prefer to expand the sample size until it's more clear. Here is an example For A not in B: trying A 10000, B 50000 naive test 0:00:10.319032 set test 0:00:00.009001 listex test 0:00:00.005000 For A subset of B: trying A 100000, B 500000 set test 0:00:00.180018 listex test 0:00:00.054006 If I increase a and b by 10 times the execution time goes up by about 100x on the set. This is because its related to a * b. the listex only goes up by about 10x which is about 10a + 10b. trying A 10000, B 10000 naive test 0:00:02.661266 set test 0:00:00.002000 listex test 0:00:00.005001
c = [val for val in a if val in filterset] This is the crux I think. The complexity is the number of elements in filterset. Since filterset has to be check for every value in a its O(len(a) * len(b) * filter search time). You wrote O(n+m) but I think that is incorrect.
The module (listex) I linked to implements the other operations as well. I didn't mean just intersection I was just using that as an example. If you run the module it will generate some time comparison for that, but its just for comparison not meant to be exhaustive. It's the set of functions I have found useful in the past. It's not all of the ones in the module I regularly use for purely academic purposes actually.

On Fri, Sep 20, 2019 at 6:19 AM Richard Higginbotham <higginbr@gmail.com> wrote:
c = [val for val in a if val in filterset] This is the crux I think. The complexity is the number of elements in filterset. Since filterset has to be check for every value in a its O(len(a) * len(b) * filter search time). You wrote O(n+m) but I think that is incorrect.
Yes, this IS the crux. If filterset were a Python list, you would be correct, but if (as the name suggests) it's a Python set, then checking "val in filterset" takes constant time, meaning that the entire process takes linear time. That is the entire point of using a hashtable for a set. ChrisA

It's not really constant though. Look at all of the test results here. Richard Musil in the following post shows one value of A=1000,B=10000 with .034 ms and the next point A=10000,B=50000 If the hash were constant time and A increased by 10x then the time should be 0.34ms but its not. It's 0.54 ms. The next step up is 8.5ms instead of 0.54 ms. If you start including the time it takes to convert lists to sets its even more pronounced. hash values can collide and the bigger the data set the more likely it is to happen. Perfect hash functions don't exist.

On Sep 19, 2019, at 19:18, Richard Higginbotham <higginbr@gmail.com> wrote:
It's not really constant though.
It’s really hard to have a discussion when all of your posts are all replies, but you don’t give us any clue what you’re replying to. There are multiple responses from multiple people since your last email, each with multiple paragraphs covering multiple subjects. So I have no idea what “it” refers to. Or what “though” is challenging. But I think what you’re trying to do is argue that hash tables don’t work.
hash values can collide and the bigger the data set the more likely it is to happen
No. The mean rate of collisions does not depend on the size of the data set, it depends on the load factor—the size of the data set divided by the size of the hash table. This means that, even if you don’t know the size of the needed hash table in advance (which we actually _do_ know here), just expanding geometrically (the same way lists expand) whenever you reach a triggering load factor is good enough to guarantee amortized constant time for all operations. The fact that it sounds implausible to you at first glance and you can’t understand one set of numbers does not overthrow decades of theoretical proofs and practical experience. Hash tables really do work.

You know, if anyone wanted a good example of why Computer Science courses should carry on teaching about sorting algorithms despite them being already implemented for you these days, this thread is it. Very nicely explained, folks! -- Rhodri James *-* Kynesim Ltd

Andrew Barnert wrote:
The number of values in A is the limiting factor for set operations. That's why I didn't use large values for A and small values for B. Past that the time it takes to make a set from a list is the other limiting factor. I'm not proposing to replace the set library. If the user has two sets use the set functions, they are plenty fast. If the user doesn't have sets then my list functions are faster than converting to sets and then back again, at least after the size of the lists has reached some threshold.

[Richard Higginbotham <higginbr@gmail.com>]
That's not so. Python's hash tables dynamically resize so that the load factor never exceeds 2/3. The worst case then is a "large" table as full as it's allowed to get before Python grows its size. In that case, the average number of probes for a successful search is 1.65, and for a failing search 3.00. It doesn't matter to this how large the tables get. For very small tables, the average values are lower, but 1.65/3.00 is good to the full 3 digits already at tables with 4096 slots (which can hold at most 2730 items). Of course there are pathological cases, but they're exceedingly unlikely to occur when using strings as keys, because Python's string hashes are a good approximation to (pseudo)random.
Perfect hash functions don't exist.
Of course they do, but it's impractically expensive to construct a perfect hash function for dynamically changing data. In Python, it so happens that when dict keys (or sets) consist of a contiguous range of integers (range(J. J+N) for some J and N), the integer hash function usually IS "perfect": no collisions at all. That's intentional and unusual, and follows in part from that hash(J) == J for "not huge" positive integers J. Something I haven't seen mentioned here, but I may have missed it: when timing algorithms with sets and dicts, people often end up merely measuring the bad effects of hardware data cache misses when the containers get large, and especially in contrived benchmarks. In those the base data tends to get built up in a contiguous range of memory, and storing the objects in a list leaves traversing the list fetching the objects' data in the same contiguous memory order. It's as cache-friendly as can be. Traversing via a set or dict instead reads the objects' data in pseudo-random memory order instead, which can give a very high rate of cache misses. In addition, the internal hash table probes also occur in pseudo-random order, adding yet another layer of cache misses. Plain old lists are in fact pleasant in many ways :-)

On Thu, Sep 19, 2019 at 10:25:20PM -0500, Tim Peters wrote: [...]
I'm not sure if this is something specific to timing tests, or if it will hold for real-world data in non-contrived code too. If it holds for real data, then presumably it counts as as advantage of lists over sets. But if it is an artifact of naive/contrived timing tests, is there something we can do to make the tests less naive? -- Steven

This is an interesting point, which is difficult to deduce without an implementation insight. I would for example assume that there is a "list construct" which holds just the references to the actual objects (strings in our example here) and the same for the dict (hashmap), so the actual objects could be anywhere, while the "list" or "dict" can be in one (memory) place. Did you mean that, or that also the object themselves are in the "same place"? I believe there are two factors which may affect the performance. The cache miss on actual data objects (e.g. the strings), and the cache miss on the "construct" object (i.e. the list or dict). But I have no idea how big are lists or dicts (of the same size), or even if all those assumptions above are correct. On Fri, Sep 20, 2019 at 10:43 AM Steven D'Aprano <steve@pearwood.info> wrote:
I'm not sure if this is something specific to timing tests, or if it will hold for real-world data in non-contrived code too.
The only thing which could impact it on the "real data" is an execution environment, i.e. for example, if the cores/threads are loaded and the operations get preemtped by the scheduler (which would also invalidate the cache). Then depending on the frequency of those invalidation it could make the importance of the "fit into cache" condition less important. But it would depend on the system scheduler and the platform configuration (may differ on the server and on the normal desktop machine). But technically the difference should not be bigger on the real data. Richard

On Sep 20, 2019, at 03:18, Richard Musil <risa2000x@gmail.com> wrote:
This is an interesting point, which is difficult to deduce without an implementation insight. I would for example assume that there is a "list construct" which holds just the references to the actual objects (strings in our example here) and the same for the dict (hashmap), so the actual objects could be anywhere, while the "list" or "dict" can be in one (memory) place. Did you mean that, or that also the object themselves are in the "same place"?
With a list of strings, it’s actually even more than that. A CPython list is a (pointer to a) struct that holds, along with a few other things, a pointer to an array of PyObject* pointers. This array is obviously contiguous by definition. By comparison, a set is a (pointer to a) struct that holds a pointer to a hash table, which is an array of buckets in hash order, which isn’t the order you’re going to be accessing them in for the b set. If the values are strings, the pointers point to PyUnicode structs, each of which holds… well, it’s complicated, but in this case (where they’re all pure ASCII and constructed out of normal Python operations) it’s going to turn out to be a pointer to a char*, which is the array of actual characters. Also notice that string objects cache their hash in the struct, so for the set case, it only has to go to the char array at the very end to double-check equality; the operations before that only go one level down instead of two. Comparisons have to go all the way to the chars every time. The PyUnicode structs get allocated by a special object allocator which is somewhat complicated. When you construct a bunch of objects in order in a fresh process they’re likely to end up mostly in contiguous order, maybe with a few jumps every so often. This would not be as true in a long-lived process that’s created and freed zillions of previous objects. And then the char arrays are allocated by a different allocator, but are also likely to be largely in contiguous order in this test, but not so much in a long-lived process. In theory, because str is a known immutable type, Python is allowed to merge dups into the same object. But in practice, we’re creating midsize strings out of __add__ and join, so I don’t think CPython will ever do that here. (I don’t know whether an implementation designed to sacrifice speed for space more, like MicroPython, might?) Of course if you generate random strings and sort them, neither the objects nor the char arrays will be contiguous anymore, but the list’s pointer array will be. My version of the test generates strings that happen to be in order and shuffles them so we can sort them. In that case, the sorting part won’t have contiguous input in the objects and char arrays, but the input to the step-compare part will be close to contiguous again (although not exactly so, if there are any dups that could get shuffled out of order), so it’s still giving the list code an unfair advantage that it wouldn’t have in a real life unsorted case.

[Tim]
[Steven D'Aprano <steve@pearwood.info>]
I'm not sure if this is something specific to timing tests, or if it will hold for real-world data in non-contrived code too.
Depends on the data access patterns of the app. There's really no simple way to know. One possible clue: in recent CPythons, sys._debugmallocstats() can be called to get info about the state of Python's small object allocator. Compare the "# bytes in allocated blocks" output line to the total number of bytes across all arenas. If that ratio is close to 1.0 then at least we're _using_ most of arena memory. The closer to 0 it gets, the more data is splattered around, with large gaps between used bytes. Since CPython _never_ moves an object in memory, fragmentation can become nearly arbitrarily bad. As bad as using only 8 bytes out of each 256 KiB allocated (although I expect it would take effort to contrive a program in the same universe as being that bad).
What's the first rule of benchmarking? RUN ON A MACHINE AS QUIET AS POSSIBLE So it's deliberately and spectacularly naive. Benchmark results should really be viewed as lower bounds: guarantees that no related real-life code will ever run faster, and almost always slower ;-) They remain the easiest way to guess whether one approach "is likely to" run faster than another, though. An easy example of major cache effects: xs = list(range(1000000)) sum(xs) Then do the same, but use random.shuffle(xs) before summing. Just timing the `sum(xs)` part, a quick timeit run on my non-quiet box just now showed the latter taking well over 4x longer. The original spelling tends to access RAM (for the guts of the int objects) in sequential order, but shuffle first and sum() leaps all over RAM to get from one int to the next. The memory layout is the same either way, so staring at obmalloc stats won't yield any clue in that case.

Tim Peters wrote:
Very informative, thank you. I tried different tables sizes and saw that you are correct that the table size doesn't matter. I assumed it was collisions that were adding the extra time when I increased the size of A and B. You are probably right that cache misses are a/the factor. In theory for algebraic set operations the size of the left hand argument (that's having its values "compared" to the other set) is the limiter. When I saw that time increase beyond the size increase of A, I thought it must be the hash functions but I didn't really want to say that not knowing the internals. Actually now I wish it was as it would easier to account for in judging my algorithm :-).

On Sep 19, 2019, at 20:25, Tim Peters <tim.peters@gmail.com> wrote:
Another issue: the container itself is larger. A list’s array uses one pointer (8 bytes) per element. Even without looking at the implementation of set, a hash bucket obviously has to be at least twice that size. Plus, lists need less slack, and put their slack all at the end where you don’t need to traverse, while hash tables have their slack distributed within the table where it does get in the way. So, a list should be able to go at least 2.2x as big as a set before hitting the threshold where it’s no longer all in cache. But of course this is only true if you can forget about the objects themselves. Each string object is a lot more than 16 bytes, and the character strings themselves are as well in most tests, so the top-level difference is often swamped. Still, the nonlinear jumps when you cross boundaries like “everything in cache”, “at least the top level all in cache”, etc. (and at each level of cache) might be visible distortions (or visible reflections of a real benefit, if your application’s behavior is close enough to the artificial benchmark). You can test the behavior of list here by using a huge list of numbers from 1 to 100… but you can’t do that with set, because that would be a tiny set of 100 elements. And of course even once you know the exact characteristics of the cache behavior with your real data, that can be hard to optimize for beyond “try to keep adjacent things adjacent”. There’s a paper from the GHC guys many years ago about how they cleverly tuned the ropes (linked lists of mid-sized arrays) they used to optimize large string internals around the L2 cache and found that the code that worked great on their older Intel CPU was a huge pessimization on the next generation of CPUs because they were fooling the computer into thinking they were never going to read past the end of the current page so it didn’t prefetch the next one.

On Thu, Sep 19, 2019 at 10:18 PM Richard Higginbotham <higginbr@gmail.com> wrote:
Richard, I took your listex.py and modified it to use timeit module for benchmarking and put it here ( https://gist.github.com/risa2000/c44c1a06e89f348d82efcce5ec722774). The results I got from the modified code were: For A not in B: trying A 1000, B 5000 best naive_relcomp test time (msec): 38.125 best relcomp_set test time (msec): 0.034 best relcomp_set with set init test time (msec): 0.218 best list_relcomp test time (msec): 0.211 For A not in B: trying A 10000, B 50000 best relcomp_set test time (msec): 0.545 best relcomp_set with set init test time (msec): 3.093 best list_relcomp test time (msec): 2.101 For A subset of B: trying A 100000, B 500000 best relcomp_set test time (msec): 8.508 best relcomp_set with set init test time (msec): 55.001 best list_relcomp test time (msec): 29.242 Now, what is interesting about this is the benchmark I added: calculating the set difference while including the set setup (basically the set(list) call). This is the line with "best relcomp_set with set init". The results for those not familiar with Richard's listex.py are: 1) 'relcomp_set' is calculating set difference by using subtraction operator (a - b) on two sets built from the two unsorted lists. 2) 'relcomp_set with set init' is the same function but with the set setup included, so basically measuring set(list_a) - set(list_b) 3) 'list_relcomp' is the function using Richard's implementation of the difference on the lists (again unsorted). I did not verify the results (I hope they are correct), but the timing suggest that the set operation is always faster. When however we add the set setup to the mix, the build up time, which is O(n + m) becomes more significant than actually sorting and comparing the lists which is technically of O(nlog(n) + mlog(m)). It would also suggest that for some large lists the list implementation _can_ actually be faster if we calculate in an eventual conversion of the lists to the sets and possible even more faster if the lists were already ordered. What I cannot explain, why the setup time becomes so significant only at A 10000, B 50000 and does not make much of a difference in case A 1000, B 5000. It would suggest that there is some other factor which has a different impact depending on the set size. Richard M.

On Thu, Sep 19, 2019 at 11:58 PM Richard Musil <risa2000x@gmail.com> wrote:
Ok, I misread the original code, the lists were not sorted in the previous results (and their should be). So with the correction to have them sorted, the results are: For A not in B: trying A 1000, B 5000 best naive_relcomp test time (msec): 38.147 best relcomp_set test time (msec): 0.035 best relcomp_set with set init test time (msec): 0.219 best list_relcomp test time (msec): 0.330 For A not in B: trying A 10000, B 50000 best relcomp_set test time (msec): 0.546 best relcomp_set with set init test time (msec): 3.293 best list_relcomp test time (msec): 3.493 For A subset of B: trying A 100000, B 500000 best relcomp_set test time (msec): 8.651 best relcomp_set with set init test time (msec): 55.133 best list_relcomp test time (msec): 144.094 Which actually puts it in line with the expectations. I guess there is not much more to say. Richard M.

On Sep 19, 2019, at 15:18, Richard Musil <risa2000x@gmail.com> wrote:
Ok, I misread the original code, the lists were not sorted in the previous results (and their should be). So with the correction to have them sorted,
I think to be fair, you want to show it _both_ ways, just as you’re showing sets both with and without creating the sets. Because sometimes you’re already going to have sorted lists, just as sometimes you’re already going to have sets. So there are four things to compare: * set operation on existing sets * set operation including time to build the sets * step-compare operation on pre-sorted lists * step-compare including time to sort the lists (Also, for the set operation, there’s no need to build _two_ sets. Just build a set of one and intersect it with the other list.) Anyway, from your two separate results, it looks like: * If you have sets, set operations are faster. * If you have unsorted lists, set operations are faster. * If you have sorted lists, the times are a lot closer, and may vary in a nonobvious way, so if it matters, you probably want to profile with your actual data and the specific operation you need.

On Fri, Sep 20, 2019 at 1:52 AM Andrew Barnert <abarnert@yahoo.com> wrote:
Right. It just was giving wrong results and I was not sure if the algorithm on "particularly unsorted" list would be representative (because it would produce the wrong results), so I removed it. I added a new test `list_relcomp[presorted]` with pre-sorted (just to have the results correct) to see if "making it correct" has any impact. So there are four things to compare:
Right (again). When I was considering the algorithm on the lists, I thought that technically, I could build just one set on the fly, but the other one has to be at least hashed to test the inclusion, and I also assumed that the hash operation will be dominant in this and basically told myself that the cost of building the other set will not be significant. I just added another implementation to the test (which I believe you had on mind): ``` def relcomp_set_list(a, b): bset = set(b) return [ia for ia in a if ia not in bset] ``` and confirmed that avoiding the 2nd set build actually brings a noticeable improvement :). I also changed the naming as the original code used `relcomp_set` to mark set operation 'a - b', but returned the results as the list, which seemed to me a bit confusing, so I renamed it to `relcomp_set_to_list` (do not confuse it with `relcomp_set_list` mentioned above) and put there `relcomp_set`, which does pure set 'a - b' and returns a set. So one can see the overhead of the resulting list construction in the original `relcomp_set`. The results are: For A not in B: trying A 1000, B 5000 naive_relcomp (msec): 38.105 relcomp_set (msec): 0.025 relcomp_set_to_list (msec): 0.034 relcomp_set with set init (msec): 0.215 relcomp_set_list (msec): 0.192 list_relcomp (msec): 0.329 list_relcomp[presorted] (msec): 0.214 For A not in B: trying A 10000, B 50000 relcomp_set (msec): 0.416 relcomp_set_to_list (msec): 0.545 relcomp_set with set init (msec): 3.167 relcomp_set_list (msec): 2.546 list_relcomp (msec): 3.372 list_relcomp[presorted] (msec): 2.005 For A subset of B: trying A 100000, B 500000 relcomp_set (msec): 8.557 relcomp_set_to_list (msec): 8.519 relcomp_set with set init (msec): 55.783 relcomp_set_list (msec): 48.903 list_relcomp (msec): 143.577 list_relcomp[presorted] (msec): 120.911 There is still one peculiarity in the last test, where relcomp_set and relcomp_set_to_list gave basically the same result. I interpret is as that the result is of a small size, basically insignificant to the other operations in play. Richard M.

On Sep 20, 2019, at 02:41, Richard Musil <risa2000x@gmail.com> wrote:
Or just `set(b).intersection(a)` and maybe checking the len to pick which one. If you really want to go overboard with micro-optimizing this version, you might want to also check whether stashing bset.__contains__ and using that speeds things up (which may be different across recent versions of Python with changes to method calling?), and using it in filter instead of a comprehension (to move the loop fully into C, at the cost of needing to build a list out of an iterator at the end). But I don’t really think we need to go overboard with that. We already know that using a set is faster except possibly when you already have pre-sorted lists, and if you do have that case, I’m not sure pressing further with this artificial benchmark will tell you much anyway.

Richard Musil wrote:
On Fri, Sep 20, 2019 at 1:52 AM Andrew Barnert abarnert@yahoo.com wrote:
On Sep 19, 2019, at 15:18, Richard Musil risa2000x@gmail.com wrote:
After some thought I don't think the the time comparisons are useful unless we include the set init time. If we already have two sets, would anyone want to compare and convert that into a list? In the use case where we read a list of files, wouldn't we want to put that into a list instead of a set? I would think we would only put it into a list if we knew we would be doing set operations. At that point we would need to compare adding the elements individually to a list versus a set to then compare the set operations with list operations fairly. At small data sizes we are probably just comparing the speed of python with C. Perhaps I should write a C extension module to make sure its and apples to apples comparison. I'll try that when I get some time.

On Sep 20, 2019, at 09:21, Richard Higginbotham <higginbr@gmail.com> wrote:
And? If I want to get a-b, a&b, and b-a, as I’d need to for a directory diff, I’m not going to do it by keeping them in lists and converting to sets three times. Just like I’m not going to keep them unsorted and sort them three times to do step-compare operations. So the cost of the operations themselves, separate from the setup time, clearly does matter.
At that point we would need to compare adding the elements individually to a list versus a set to then compare the set operations with list operations fairly.
Why would you be adding them individually? Surely, at least if you care this much about micro-optimization, you’re going to be adding them in bulk. If you’re reading a file full of file names, you’d just do list(f) or set(f). If you’re reading 10 files full of file names, or getting them out of os.walk steps, you’re going to call extend or union with each batch, not iterate one by one and add them. Of course the actual details depend on the actual use case, and I’m not sure why you’re trying to come up with a microbenchmark that exactly fits some imaginary and ill-defined use in the first place. Do you then need to go simulating a filesystem’s readdir performance characteristics in some particular but arbitrary scenario? What’s the goal here? I think it’s pretty easy to establish that sets are fast but they’re not magic, sorting is pretty good but it’s not magic, there are some cases where step-compare is faster if you have already-sorted lists… and none of this matters anyway, because in real life there are very few use cases where you can equally choose between sets and step compare and need to micro-optimize, but lots of cases where you have _other_ reasons to choose between them. Sets work on non-comparable values; step-compare works on non-hashable values. Step-compare (if implemented right) works on lazy iterables; sets can preserve the order of the “master”input. They do different things with duplicates. Even when performance is the only issue (and it’s not dominated by the file system anyway, as it would be in your case), the characteristics of your specific data are going to make a huge difference. Not to mention that there are often other compelling reasons to have a set or a sorted list, at which point the choice is made for you. So, this already eliminates the possibility of reimplementing set operations with step-compare as suggested earlier. But it also already makes a compelling case for adding step-compare functions to Python—but that case needs to be made clearly. I think they’d fit very nicely in itertools, because they already provide an algebra on iterables, including functions like group you that expect to be handed pre-sorted iterables, but maybe not everyone agrees. I don’t know whether their usefulness (and stability) is so obvious that they don’t need to go on PyPI to prove themselves. I also don’t know whether they already exist on PyPI (maybe as part of more-itertools and/or toolz?). Also, I’m pretty sure there are questions on StackOverflow and elsewhere from people trying to do step-compare and getting it wrong, which could be another argument for having it in the stdlib, or at least the recipes in the docs, but only if someone digs up the questions. And there can be bikeshedding discussions about exactly which operations are needed, and what to call them. But further artificial micro benchmarks aren’t going to help resolve any of those questions.

BTW, this thread seems a good place to mention the under-appreciated SortedContainers package from Grant Jenks: http://www.grantjenks.com/docs/sortedcontainers/ This supplies sorted containers (lists, dicts, sets) coded in pure Python, which generally run at least as fast as C extensions implementing fancy tree structures. The basic implementation "trick" is to maintain, under the covers, a list _of_ Python lists. Not deeper than that. It you're familiar with the lingo, it's most like a memory-resident 2-level B-tree. It scales extremely well, to billions of elements - to sizes the fancy packages can't get near, because their space overheads are so much higher they run out of RAM. There's extensive explanation of that here, which should be read to get a feel for just how much of modern processor behavior hasn't even been mentioned in _this_ thread ;-) http://www.grantjenks.com/docs/sortedcontainers/performance-scale.html Here's a tease: """ For comparison, consider traversing an AVL-binary tree with one billion elements. A highly optimized implementation will require at least 24 gigabytes of memory. The binary tree will likely traverse thirty levels, each of which is a data-dependent lookup. Some lookups will have good locality but most will not. Each lookup could be hundreds to thousands of times slower than sequential accesses. These slow lookups are why Sorted Containers can afford to shift a thousand sequential elements in memory and have most additions take less time than binary tree implementations. """ It was very much designed with cache behavior in mind, where a hit in L1 cache can easily be 100 times faster than needing to go to RAM. That said, the package's SortedSet still uses a Python set under the covers, because he didn't find anything that could beat that. The sorted order is maintained, under the covers, in a parallel SortedList. For that reason, adding an element to (or removing one from) a SortedSet takes O(log(N)) time rather than O(1).

Andrew Barnert wrote:
For A not in B: trying A 100000, B 500000 best relcomp_set test time (msec): 21.273 best relcomp_set with set init test time (msec): 143.083 best relcomp_set with set init intersect test time (msec): 79.064 best list_relcomp test time (msec): 35.025 best sort test time (msec): 9.160 For A not in B: trying A 1000000, B 5000000 best relcomp_set test time (msec): 282.933 best relcomp_set with set init test time (msec): 1795.730 best relcomp_set with set init intersect test time (msec): 1292.691 best list_relcomp test time (msec): 352.033 best sort test time (msec): 87.534 The first point shows my algorithm at about the same performance as a raw set op. So does the second. That's C code versus python. The time for the whole operation though is in another level. While typing I got another For A not in B: trying A 1000000, B 50000000 best relcomp_set test time (msec): 293.465 best relcomp_set with set init test time (msec): 19223.592 best relcomp_set with set init intersect test time (msec): 10237.699 best list_relcomp test time (msec): 995.477 best sort test time (msec): 735.441 this just continues the trend.
I'm not concerned about micro op, I'm concerned with macro op on large data sets. Right now if someone comes to you with that simple use case of two large lists you have to tell them to convert to sets and then back again. Forget the new programmers who tried the naive way and quite because "python is too slow", they are waiting 20 seconds when we could deliver it to them in 1 second. If I can get within 10x of C code that seems like it would be useful for that type of use case. I don't know if my case is common and I've used it on more esoteric problems that wouldn't expect anyone to do, I'm used to dealing with a lot of data, but if it can be of help I came it to make it available.

On Sat, Sep 21, 2019 at 7:54 AM Richard Higginbotham <higginbr@gmail.com> wrote:
I'm not concerned about micro op, I'm concerned with macro op on large data sets. Right now if someone comes to you with that simple use case of two large lists you have to tell them to convert to sets and then back again. Forget the new programmers who tried the naive way and quite because "python is too slow", they are waiting 20 seconds when we could deliver it to them in 1 second. If I can get within 10x of C code that seems like it would be useful for that type of use case. I don't know if my case is common and I've used it on more esoteric problems that wouldn't expect anyone to do, I'm used to dealing with a lot of data, but if it can be of help I came it to make it available.
If you directly construct a set from your source of truth (eg from a directory iterator, if you're looking at file names), you don't have to "convert a list to a set". Even if you did have to convert (or technically, to construct a set with the elements of that list), how many elements do you need before it's actually going to take an appreciable amount of time? I don't mean a measurable amount of time - a handful of microseconds can be measured easily - but a length of time that would actually make people consider your program slow, which probably means half a second. Now multiply that number of elements by, say, 64 bytes apiece (15-character strings in Python 3.4+), and see if that would mean you have other issues than the actual hashing. I think you're still looking at a microoptimization here. ChrisA

On Sep 20, 2019, at 14:51, Richard Higginbotham <higginbr@gmail.com> wrote:
Andrew Barnert wrote:
What’s the goal here?
I'm not concerned about micro op, I'm concerned with macro op on large data sets. Right now if someone comes to you with that simple use case of two large lists you have to tell them to convert to sets and then back again. Forget the new programmers who tried the naive way and quite because "python is too slow", they are waiting 20 seconds when we could deliver it to them in 1 second.
But today, you can deliver it to them in 1 second: just give them `set(b).intersection(a)` or `set(a) & set(b)` or `sb = set(b)` then `[x for x in a if x in sb]` and you’re done. They can easily understand why it works. If they want to know why it’s faster, you can easily explain it, and they’ve learned something widely useful. If we add a new function, whether it’s in builtins or itertools or elsewhere, then you can give them that function and you’re done; that’s no easier or harder. They probably won’t immediately know how it works, but if they ask, they’ll learn something useful. And once they get why it works, they probably will get why it’s faster. So, overall, the new solution is roughly the same benefits as the existing one that’s been there since Python 2.3.
Or you write the step-compare manually, the way C programmers have done for decades. Or you find a third-party library or copy and paste code off a Stack Overflow answer. Converting to set is easier, and usually the right answer. The important question is, for those cases where it _isn’t_ the right answer, is “write it yourself or find it on SO” good enough?
No you’re not. You’re trying to make the case that step-compare is better than set in all uses. Which isn’t true. And it isn’t necessary in the first place. The case that needs to be made is that it’s good for some cases that we can’t handle today, and that at least one of those cases matters, and that the current state of affairs where people have to go figure out how to do it is not good enough.
I've never come across another step and compare.
People do this all over the place. For example look at the merge step of any merge sort: it’s a step-compare union, possibly in-place, possibly n-ary instead of binary, but recognizably the same thing. In C++, the stdlib has all of these algorithms, abstracted to work on any forward Iterators, in two versions that handle duplicates differently, plus in-place versions of some of them.

Andrew Barnert wrote:
This isn't technically correct. It's not faster. It all depends on the use case which when it contradicts your expectations you just deride as "artifical micro benchmarks". Python isn't just used as a toy scripting language. 50 million elements in a collection is not even large by a long shot at least from where I sit. You can make a case that it's not a good language for that type of problem, say HPC clusters. Or you can tell people to go copy some C code "like people have been doing for decades". That you ask if that is a proper response to your users is very concerning to me.
So now all the scientist using Python also need to learn C, or at least be quick to copy, paste, or know the "secret" knowledge of using sets except for their problems it may be much slower.. so they'll also need to know all about pythons implementations of sets, cache misses, and those artificial micro benchmarks that give people a sad so we can't talk about. I'm actually OK with that as an answer as condescending and unprofessional as it may be.
If you ever find yourself arguing that you know what someone else is thinking better than they do, you should stop right there because you are probably wrong.
So someone can drop by in an attempt to help you so long as they come on hand and knee and prove to you that you need their help. I don't think that can ever work. When you start off by saying their use case doesn't matter it's clear you've already made up your mind. It's obvious you've taken this personally and would go through any hurdle to thwart a consideration of my proposal. That's OK, like I said it wasn't for me, it was for other people.
I thought we were talking about Python and algebraic set operations and I said after my previous work I never needed to..
I did exactly what you just said earlier I wrote a C extension for this. I suspect I could do it again, but as it would be an investment of time and effort so I thought I would try to get a consensus view. This process and in particular this discussion has convinced me that even if I want to go out of my way to help this is not the community to be doing it in. There are some good people here who are also very technical. You should talk to them because there is such a huge difference between say the response of Tim Peters and yourself that it's like night and day. That's all I have to say.

On 2019-09-20 9:52 p.m., Richard Higginbotham wrote:
Richard, I can identify with the need to intersect large sets; and I often faced with poor choices when it comes to dealing with them. I love Python, and I would love to use it for my data munging, but the data is too big to fit in memory and Python is too slow. I find it disappointing to convert an elegant expression declaring *what* I want, like "a-b", into code that declares *how* to do it, like "set(b).intersection(a)". This disappointment is magnified by the fact someone, somewhere already written code to do set subtraction faster, and I can not use it; my choices are quick ugly code, or a time consuming search for an elegant solution** on the internet. Maybe we are looking for a different type of solution. You seem to be asking for the elegance of Python with the expression optimization (aka query planning) of a database. A database may be able to deliver the speeds you require; it can pack data tighter; it has access to more processors; it may already leverage SIMD; it can optimize the operation according to how big the data is. I am suggesting a Python container implementation, like sortedcontainers but using a database (maybe Sqlite), may be a solution. Of course, there is the problem of moving data in and out of the database, but maybe that can be amortized over other operations, and made relatively insignificant. There is the delay when translating __sub_() call into SQL, but maybe that is relatively small compared to size of work we are requesting. May you link to your repo where these tests are run? I seem to have lost the link in the long chain of emails on this list. I am considering adding a Sqlite example, if only to prove to myself that it is the slowest option of all. Thank you. ** The sortedcontainers mentioned is an interesting demonstration of how fast Python can get when you are aware of L1 cache effects.

Kyle Lahnakoski wrote:
Hi Kyle, The repo is here: https://github.com/ponderpanda/listex I'm still looking into sortedcontainers and dominik's suggestion of numpy. Both seem promising. At the moment, I'm more concerned with in-memory operations. Disk access complicates things, but I think it would be interesting to include those. We've seen how memory access times can turn superior algorithms into inferior ones at scale. It's very likely that could be true when using disk access times. I'm not sure anyone would want to wait for that benchmark to finish though. My suspicion is that a hybrid approach would be the most optimal. It get's more complicated when you consider you could throw in more computers to increase in memory sizes. If you are going to use my routines for speed testing with large data sets I would try simple timers instead of using the bench mark set. It takes much longer to run the tests repeatedly and the timing precision at scale is going to be drowned out by the long run times. I've been considering parallelization / distribution (sharing cpu and memory) and I think I have a pretty good idea of how to do it. I just haven't needed with my work yet. I think set style (hashtables) would be less likely to benefit from parallelization but I haven't looked into it at all yet. Depending on the size of your datasets you might benefit more from that than relational databases.

I think there is a problem here using timeit for this type of problem, look at the large amount of time used by the set init time. Noticing this I changed my sorts to in place and got better results. I assume there is some copying of locals or other overhead but I can't explain the results. https://github.com/ponderpanda/listex/blob/master/listex_timit.py For A not in B: trying A 1000, B 5000 best relcomp_set test time (msec): 0.039 best relcomp_set with set init test time (msec): 0.247 best list_relcomp test time (msec): 0.280 For A not in B: trying A 10000, B 50000 best relcomp_set test time (msec): 0.618 best relcomp_set with set init test time (msec): 4.156 best list_relcomp test time (msec): 2.890 For A subset of B: trying A 100000, B 500000 best relcomp_set test time (msec): 12.664 best relcomp_set with set init test time (msec): 73.243 best list_relcomp test time (msec): 146.397 For A not in B: trying A 100, B 500 best naive_relcomp test time (msec): 0.428 best relcomp_set test time (msec): 0.004 best relcomp_set with set init test time (msec): 0.020 best list_relcomp test time (msec): 0.027 I still don't trust it but it seems more reasonable. It also lines up with my previous results a little better.

On Sep 19, 2019, at 13:18, Richard Higginbotham <higginbr@gmail.com> wrote:
This is exactly what timeit is for. It repeats the tests multiple times and aggregates appropriately, and it does half a dozen other things that you haven’t even thought of and therefore aren’t even trying to correct for. And of course it’s been tested through decades of widespread use.

It might be interesting to note that Numpy provides various set routines operating on their arrays (and hence lists as well, by conversion): https://docs.scipy.org/doc/numpy/reference/routines.set.html For [intersection](https://docs.scipy.org/doc/numpy/reference/generated/numpy.intersect1d.html) for example they do the following: 1. Concatenate the arrays, 2. Sort the result, 3. Compare subsequent elements for equality. Most likely because for each of the steps, there is a C extension that provides an efficient implementation. For [membership testing](https://github.com/numpy/numpy/blob/d9b1e32cb8ef90d6b4a47853241db2a28146a57d...), i.e. check which elements of `a` are in `b`, however, they use a condition where they decide that sometimes it's faster to use a loop over `a` and `b` instead of the concat+sort approach: if len(b) < 10 * len(a) ** 0.145 Not sure where they got the exact numbers from (maybe from benchmarks?).

Let me expand on the reasons for my post here. It seems like motivation has turned into a bit of a bike-shead. I have an algorithm I wanted to see if the python devs would be interested in. It fulfills my use case as is. I've found it useful for different problem areas and I haven't seen anything similar so I thought it might be useful to others. It's written in Python. It used to require containers with 100k or more elements to see a speed improvement over some other unnamed methods. These days its closer to 50 million with the only other contender being set (hash based) operations at least that I know of. I think there is a reasonable chance that if it were converted to C it would be comparable to if not better than the Set operations for people who wanted lists for what ever reason, even with much smaller data sets. One issue is that it is difficult to time test because it has side effects and because well computers. It's also difficult to know what users would put into a list so that we determine it's usefulness, either in relation to set operations or not. I'm used to thinking in lists so I don't have good judgement on how useful it would be to other people. I'm most concerned with getting feedback from the list to determine if it would fit any other use cases. If there is some kind of need for it as a general function. If no one ever uses lists with more than 100k elements then it's not worth the time to look into. There's nothing wrong with that. I just thought that if I could give back something a little unique for my years of using python it would be fun and worthwhile.

On Fri, Sep 20, 2019 at 11:44:17PM -0000, Richard Higginbotham wrote:
Let me expand on the reasons for my post here. It seems like motivation has turned into a bit of a bike-shead.
I don't think that anyone is questioning your motivation. (Although one of your recent replies to Andrew seems to be casting some pretty mean-spirited aspersions on *his* motivation: "[Andrew] would go through any hurdle to thwart a consideration of my proposal". And honestly, your personal motivation isn't really important. If you are doing it for the fame and money, you'll probably be disappointed. If you are doing it to give back to the Python community, that's great, but the community and the core devs have no obligation to accept it into the standard library. You can always put it on PyPI, you don't need to ask anyone's permission, just go ahead and do it. The core devs have a duty to curate the standard library. They cannot accept everything proposed, regardless of the author's feelings or motivations or how useful it is to them. Also their own personal selfish motivation that they are over-worked and under-appreciated as it is without taking on the burden of *yet another* library to maintain if it isn't justified. Part of the process for justifying that library or set of algorithms is to run the gauntlet of Python-Ideas, and for good or ill, most of the people here tend to be pretty conservative in technology matters, possibly even more conservative than the core-devs. The people here on Python-Ideas are a subset of the Python community, if you can't convince us that your idea is good, you probably aren't going to get much interest from the rest of the community either. Don't take it personally, it's just the nature of the Python community. If you want a community more open to change, you might consider the Node.js or PHP communities, or so I'm lead to believe. [...]
Am I misunderstanding you? You seem to be saying that your algorithm has become *slower*. In the past, your algorithm beat other unnamed algorithms for moderate sized lists of 100K elements, but now your algorithm doesn't beat them unless the lists are large, 50 million elements or so. So in relative terms, you seem to be saying that your algorithm is now 500 times slower relative to alternatives than it used to be. I don't think that's what you mean, so can you clarify?
Possibly. As we have seen from this thread, the benefits of locality can sometimes make up for a less sophisticated algorithm. If I've learned anything from this thread, it is that relying on Big Oh to predict performance is even less useful today than it used to be. [...]
You're on the wrong forum for those questions. Here, we are most concerned with whether or not the proposal is a good fit for the language and stdlib, and the person making the proposal is expected to have done their homework and determined other use-cases *first*, or at least as part of the process. If you want to maximise the number of people seeing your proposal, which will in turn increase your chances of getting feedback of potential use-cases, you might try posting it on somewhere like Reddit first. -- Steven

On Sat, Sep 21, 2019 at 1:44 AM Richard Higginbotham <higginbr@gmail.com> wrote:
You are right, it is definitely not easy to benchmark, which I realized after I digested what Tim and Andrew wrote about the string representation, hashing and caching: 1) There are at least two levels of separate meta-data constructs, before reaching the actual "const char*" for a string in a string list. 2) The hashes on strings are cached. 3) Sequentially created objects most likely end up in a consecutive memory, when you shuffle them, everything gets messed up. So I made a series of test to prove some of my theories. I am using timeit (because I believe it is the best I can do), but I am trying to use while taking into account the points mentioned above. The one consequence is that all timeit tests I run only in one iteration to avoid any side effects from cached hashing (I will demonstrate it on an example). I choose relatively large datasets (10M string lists) where the execution time should be sufficiently long to justify one iteration rule (my machine is not stressed at all at the time I run the tests). I choose the set difference operation (A - B) as the benchmark, which your code implements in `list_relcomp` function (I got the latest version of `listex_timit.py` from your repo). First let's setup the environment: In [2]: from os import urandom ...: from random import shuffle ...: from timeit import timeit ...: from listex_timit import list_relcomp Now let's see the impact of the string hash caching: In [3]: a = [urandom(8).hex() for i in range(10000000)] In [4]: timeit('set(a)', number=1, globals=locals()) Out[4]: 1.5521358999999961 In [5]: timeit('set(a)', number=1, globals=locals()) Out[5]: 0.9837615 What happens here is that I create a list of random strings, the first timeit measures the set operations where all the strings have to be hashed, the consecutive one measures only the build time for the "set construct", while hashing has already been done. The important point is that if I run the benchmark in several iterations, from the second iteration I will only be benchmarking the set construction, but not the hashing and with already 10 iterations it will basically eliminate the hashing time from the benchmark results. This may, or may not, actually correspond to the real use case scenario, so I am not saying it is apriori wrong, just that one has to pay attention to what he is benchmarking and if he is benchmarking what he thinks he is benchmarking. The second important point is a memory access pattern and the cache trashing. Let's see what happens when I shuffle the data: In [6]: shuffle(a) In [7]: timeit('set(a)', number=1, globals=locals()) Out[7]: 1.4697333000000015 Now even with hashes already cached the time went up by 50% again, because of the cache trashing. And if I start up from the fresh without any cached hashes and shuffle the data: In [8]: a = [urandom(8).hex() for i in range(10000000)] In [9]: shuffle(a) In [10]: timeit('set(a)', number=1, globals=locals()) Out[10]: 2.033926100000002 Quite a difference from the "best case". Let's see now, how the cache trashing impacts the lists (there is no hash caching on list ops so the time is invariant to the iterations): In [11]: a = [urandom(8).hex() for i in range(10000000)] In [14]: timeit('list(a)', number=1, globals=locals()) Out[14]: 0.13943980000001943 In [15]: timeit('list(a)', number=1, globals=locals()) Out[15]: 0.13443960000003585 In [16]: shuffle(a) In [17]: timeit('list(a)', number=1, globals=locals()) Out[17]: 0.25544789999997874 In [18]: timeit('list(a)', number=1, globals=locals()) Out[18]: 0.2587161000000151 With the combined knowledge from the previous tests, I can now run the actual benchmark for 'A - B' set difference, using the set operations and explain the results (I am converting it back to the list to make it equivalent to your `list_relcomp`): In [19]: a = [urandom(8).hex() for i in range(10000000)] In [20]: b = [urandom(8).hex() for i in range(10000000)] In [21]: timeit('list(set(a).difference(b))', number=1, globals=locals()) Out[21]: 3.8052587000000244 In [22]: timeit('list(set(a).difference(b))', number=1, globals=locals()) Out[22]: 2.6856725000000097 We can clearly see the impact from the hash caching on the time (I do not include the following timeit executions, but they were giving basically the same time from the 2nd iteration on). On your function it runs like this: In [23]: a = [urandom(8).hex() for i in range(10000000)] In [24]: b = [urandom(8).hex() for i in range(10000000)] In [25]: timeit('list_relcomp(a, b, True, True)', number=1, globals=locals()) Out[25]: 18.45155139999997 In [26]: timeit('list_relcomp(a, b, True, True)', number=1, globals=locals()) Out[26]: 7.87324250000006 In [27]: timeit('list_relcomp(a, b, True, True)', number=1, globals=locals()) Out[27]: 7.88726209999993 Let's say the first times are comparable as there was no data prep, the second times are also comparable, set ops have cached the hashes, you sorted your lists in place (which gives the significant improvement in the subsequent runs.). This may also skew your benchmark results significantly if you run it in iterations though (if the real use case does not preserve the sorting). Finally I add also the same exercise with the data shuffled, which would be possibly the worst case scenario, first the set ops In [28]: a = [urandom(8).hex() for i in range(10000000)] In [29]: b = [urandom(8).hex() for i in range(10000000)] In [30]: shuffle(a) In [31]: shuffle(b) In [32]: timeit('list(set(a).difference(b))', number=1, globals=locals()) Out[32]: 4.671694499999944 In [33]: timeit('list(set(a).difference(b))', number=1, globals=locals()) Out[33]: 3.6129220999999916 In [34]: timeit('list(set(a).difference(b))', number=1, globals=locals()) Out[34]: 3.424502699999948 In [35]: timeit('list(set(a).difference(b))', number=1, globals=locals()) Out[35]: 3.4897670999999946 then your `list_relcomp` implementation: In [36]: a = [urandom(8).hex() for i in range(10000000)] In [37]: b = [urandom(8).hex() for i in range(10000000)] In [38]: shuffle(b) In [39]: shuffle(a) In [40]: timeit('list_relcomp(a, b, True, True)', number=1, globals=locals()) Out[40]: 24.619852700000024 In [41]: timeit('list_relcomp(a, b, True, True)', number=1, globals=locals()) Out[41]: 7.913995600000021 In [42]: timeit('list_relcomp(a, b, True, True)', number=1, globals=locals()) Out[42]: 8.124123400000371 In [43]: timeit('list_relcomp(a, b, True, True)', number=1, globals=locals()) Out[43]: 8.000760899999932 I wrote all this to show that without an insight it might be sometimes difficult or impossible to do it right (I was caught myself in several pitfalls on my original benchmarks I posted here) and also because it was actually a fun to learn quite a few things from this thread. I am however not convinced that step-compare lists can compete with set operations (even when done on lists) at the current state of things. Richard M.

Richard Musil wrote:
When I first wrote the algorithms(2000 or so) I compared them to [x for x in a if x in b] and I found that the later was faster. I didn't make sense at first but it was because the 'x in b' is done in C (I had just graduated and started using Python). If the data set was large enough I could see the big O growth was greater for the C version versus the python version. Similar to what we see here in the difference between the set operations and listex. I implemented listex in C and it was much faster at all data sizes over maybe 10 elements. I don't expect the same speedup compared to sets because they are more efficient. I also can't guarantee that the trend of set operations having a higher time growth will hold true for all data sizes. I only point it out and say for some use cases set relation operations aren't optimal, but they may be good enough.

Note that CPython's sort is in the business of merging sorted (sub)lists. Any mergesort is. But CPython's adapts to take advantage, when possible, of "lumpy" distributions. For example, if you sort list(range(1000000, 2000000)) + list(range(0, 1000000)) it goes very fast (O(N)). Because it first rediscovers that the list is a catenation of two sorted sublists, then notices that because the smallest element of the left sublist is greater than the largest element of the right sublist, there's no need to compare anything else _between_ the sublists. The sorted result has to be the right sublist followed the left one. The same would work for a union operation if the sublists were given as distinct arguments. Similarly for intersection (which would deduce almost instantly that the intersection must be empty). And so on. That's not by accident - the inspiration for CPython's sort's basic "galloping" approach was taken from this paper, which wasn't about sorting at all: "Adaptive Set Intersections, Unions, and Differences" (2000) Erik D. Demaine, Alejandro López-Ortiz, J. Ian Munro That's concerned with manipulating large sets represented as sorted sequences. They noticed too that data was "lumpy" in real life, and that paper was the start of a number of papers trying ways to take advantage of that. So if someone _really_ wants to go gonzo with this, that's a promising approach. Do note, though, that it doesn't fit naturally with iterators. "Galloping" relies on O(1) access to data at arbitrary index positions; it's a fancier variant of binary search (see listsort.txt in your CPython source distribution for details). If cache misses can be so expensive, how can that win? Because the number of index accesses per galloping step is logarithmic in the number of sequence elements consumed, and "compare" steps can (very roughly speaking) be skipped entirely for all the sequence elements _not_ touched by the galloping indices. Pointers to those are merely copied, without even looking at the data they point to. Which often turns out to be a huge win in CPython, because comparison is CPython is expensive.

On Sep 22, 2019, at 15:28, Tim Peters <tim.peters@gmail.com> wrote:
I wonder if you could get any benefit by using something like a skiplist-rope hybrid, where you have effectively O(log n) random access rather than O(1), and can only copy huge chunks with O(log n) large memcopies instead of 1 huge one. Would those costs be enough to sink the benefits of galloping? It wouldn’t be usable in all cases, but there are many cases where your huge input is naturally iterable in this way (a set of files to concatenate, a huge file that you can seek in or mmap chunks of, maybe even a semi-lazy rope that you got from some Rust or Haskell library via ctypes). If you know the size of each iterator, you can build the first level or an approximate skiplist, and build the rest up on the fly to approximate binary search well enough (I think) for the purpose. Anyway, if I really did want to do these operations on massive datasets, I’d look at pandas+HDF5 to see what it can give me before considering implementing anything from scratch. But it could be a fun project… A few further half-formed thoughts on this: For a single huge file treated this way, you’d probably want to look at ripgrep. The author started off with the conventional wisdom on when to read vs. mmap vs. windowed-mmap and quickly discovered that all of the conventional wisdom was so badly out of date as to be useless… Could you get any benefit out of parallelizing the gallop seekaheads? Probably not for 1-10 million, but we are talking about huge datasets here. If you have an Iterator extended with a nextn method that can iterate a large chunk of data at a time (which a file can do if you’re treating it as a sequence of bytes or characters, but if you’re treating it as a sequence of lines it can’t, at least without a second pass…), you could build the same thing—but the cost is that you might easily end up using temporary N/2 memory, which defeats most of the benefits of using arbitrary iterables. C++, Swift, etc. don’t have a notion of a “bisectable Iterator”, where like a bidirectional Iterator it still takes O(m) time to jump forward or backward m steps, but it only takes O(1) time to jump (approximately?) halfway between two known positions. You could easily build that on top of anything random-access, and also on top of anything logarithmic like a tree or skiplist, and it seems like there are useful things you can do with it. Or maybe almost all such things are worth specifializing differently for random-access vs. inherently-logarithmic data structures, or maybe it’s just too hard to come up with a proper useful abstraction?

Tim Peters wrote:
That's funny as that was about the same time as I was working on my list compare. I actually used a binary search as well. I divided the 'merged' list into three sections a head, body, and tail. The head is where the values in one list are totally outside the other list on the 'low' (in sort order) side. The body was where the two lists started to overlap and I used the minimum of the two lists and a binary search to find that. The tail was the opposite end so there was no need to use a binary search to find where it started (its where one list ends). I was very surprised I could sort the list and step through it with any speed improvement. No matter how I changed the initial list state it was very fast. I expected faster on the upper end with large lists, but not on the lower. My first time using sort and my second python program and I lucked out. Good times!

On Mon, Sep 23, 2019 at 12:21 AM Richard Higginbotham <higginbr@gmail.com> wrote:
I believe that your "suspicion" is correct and the set variant is executed mostly in C code (though I do not have any intimate knowledge of the Python implementation, so I my be wrong), while your code is executed in bytecode and I also believe no one here claimed otherwise. This however is not a problem of `timeit`, which main feature is that it is designed for benchmarking and available in the standard library. So anyone can use it, and the results are (somewhat) comparable. If you design your own benchmarking tool (I do not think datetime is a good candidate though), anything you gain you lose on the reproducibility and comparability. Besides, saying this benchmark does not give the results I expect, would probably need also explaining what do you expect and why.
So far the benchmarks showed that that list implementation was slower than the sets at every non-trivial case, sometimes by quite a margin. I am not saying that the C implementation cannot be faster, in fact, I will probably be eager to test it, if you come up with one. But after what I read (and posted) in this thread I am also reluctant to accept that C implementation _must_ be faster (or must be significantly faster). Maybe, your code, even when run in Python, is well optimized already, and won't get a significant speedup or will be 5x faster, but, honestly, I do not dare even to guess. The only way to prove it is to actually implement it in C. Considering your use case however, I wonder, if you would not be better going with the iterator approach (as Andrew has been hinting already for some time in his posts). Richard M.

Richard Musil wrote:
The problem i have with timeit is actually just side effects. The same issue comes into play with sets if you are prebuilding them. If I use timit to run my algorithm 10 times the first execution takes the most time, and it sorts the lists, later executions are faster but I don't want those in the results. I'm looking for avg to worst case results. In the case of comparing c code with byte code we've already given set comparisons a huge boost by choosing simple hashable items and a C implementation. trying to optimize it even further by using the least amount of statements possible starts to just be silly. I only said my algorithm crossed a threshold of speed at some number of elements. That should be expected of every algorithm. It's not an indictment of the set library or Python. Lots of perfectly good algorithms don't scale well, mine obviously doesn't scale small very well.

On Sep 23, 2019, at 15:32, Richard Higginbotham <higginbr@gmail.com> wrote:
The code you posted doesn’t do any galloping, it just advances the index one at a time. The fact that you could do approximate binary searches to optimize it doesn’t help if you never do those searches. So it’ll be exactly as poor on unbalanced inputs as my Iterator code. After all, they’re exactly the same algorithm (except for trivial differences in how they handle one input or the other ending early); the only difference is that one implementation calls next and remembers the value, the other increases an index and indexes a list. Of course if you always have lists already in memory, the added overhead of using an Iterator over the list is cost for no benefit. But that’s a fixed multiplier, it’s not different for balanced vs. unbalanced inputs. And if course if the Iterator ever allows you to avoid building an otherwise-unnecessary list, it’s a win, but that win is also a flat cost that depends only on the allocation time for N elements, which also doesn’t vary based on balanced vs. unbalanced inputs. Anyway, if your use case really is a gigantic directory from a filesystem that guarantees sorted order and it supports iterdir, iterators should help quite a bit.

Andrew Barnert wrote:
In the spirit of comradery, I'll try to follow Steven D'Aprano's community advice and be a little more blunt. I'll skip the insults though. I wrote them but I took them out. Honestly, I could have done this at any time I just didn't want to hurt peoples feelings. I already converted it to C along time ago and did many many test I don't want to repeat. I know how it scales. We don't even have to convert to C. Let's just use pypy. For grins we'll change to an integer hash. For A not in B: trying A 1000, B 5000 best relcomp_set test time (msec): 0.009 best relcomp_set with set init test time (msec): 0.130 best relcomp_set with set init intersect test time (msec): 0.115 best list_relcomp test time (msec): 0.012 best sort test time (msec): 0.006 For A not in B: trying A 10000, B 50000 best relcomp_set test time (msec): 0.117 best relcomp_set with set init test time (msec): 2.448 best relcomp_set with set init intersect test time (msec): 2.649 best list_relcomp test time (msec): 0.077 best sort test time (msec): 0.048 For A not in B: trying A 1000000, B 5000000 best relcomp_set test time (msec): 82.131 best relcomp_set with set init test time (msec): 880.753 best relcomp_set with set init intersect test time (msec): 886.761 best list_relcomp test time (msec): 7.902 best sort test time (msec): 5.826 For A not in B: trying A 5000000, B 10000000 best relcomp_set test time (msec): 557.368 best relcomp_set with set init test time (msec): 3228.475 best relcomp_set with set init intersect test time (msec): 3725.047 best list_relcomp test time (msec): 25.626 best sort test time (msec): 13.632 Huh? At about 1000 elements they are equal but after that my 2nd python program is 10x faster, oops. Lets try more iterations "with a real Benchmark program". I won't repeat mine though well just take the first number that comes by. trying A 1000, B 5000 best relcomp_set test time (msec): 0.009 best relcomp_set with set init test time (msec): 0.116 best relcomp_set with set init intersect test time (msec): 0.118 best list_relcomp test time (msec): 0.015 best sort test time (msec): 0.006 For A not in B: trying A 10000, B 50000 best relcomp_set test time (msec): 0.115 best relcomp_set with set init test time (msec): 2.729 best relcomp_set with set init intersect test time (msec): 2.780 best list_relcomp test time (msec): 0.111 best sort test time (msec): 0.049 You don't want to see data sets higher than that. Ok, a little sarcasm but you earned it. All you had to do was not be jerks.

On Sep 21, 2019, at 04:35, Richard Musil <risa2000x@gmail.com> wrote:
I wrote all this to show that without an insight it might be sometimes difficult or impossible to do it right (I was caught myself in several pitfalls on my original benchmarks I posted here) and also because it was actually a fun to learn quite a few things from this thread. I am however not convinced that step-compare lists can compete with set operations (even when done on lists) at the current state of things.
One case I think you didn’t test is when the strings are generated in already-sorted order. In that case, as opposed to the case where you generate in random order and then sort, I think the PyUnicode objects and the actual character arrays will both be mostly contiguous (at least in a fresh process)—although I think the only way you could verify that is by using ctypes (or a C extension) to peek into the internal PyUnicode structs and look at the appropriate pointers. Anyway, the access pattern should more than be simple enough for the CPU to effectively cache-prefetch at all three levels, which could be a significant speedup for the sorted-list algorithms over the set algorithms. And this isn’t entirely an artificial scenario; sometimes large numbers of strings really are generated in sorted order and in large batches like this—e.g., the filesystem, or a network client, or whatever is handing you data in sorted order and you put it right into a list of strings. So, it might be worth testing. But I think the real win for already-sorted data is laziness. With sets, at least one of your data sets has to be strict, so you can put it in a set. With sorted lists, with a minor rewrite to the algorithms (which I posted earlier for intersection), you can use them directly on iterators instead. Imagine you’re reading 7GB worth of (already-sorted) strings and comparing to 2GB worth of strings on a machine with 8GB of RAM and plenty of swap space. Clearly the set, sorted list, or even anything clever that pandas might do, they’ll all throw your machine into swap hell, which will probably swamp the cost of using step-compare instead of sets and of using Iterators instead of lists. This one might be harder to benchmark (because the cost of the Iterator is going to end up threaded through the cost of the algorithm), but it seems like it should be a huge win over sets.

On Sun, Sep 22, 2019 at 8:25 PM Andrew Barnert <abarnert@yahoo.com> wrote:
I did not benchmark it, because the original code from the OP has an explicit sorting built-in, so I simply concluded this was not the case. Nevertheless...
For the sake of completeness I did some benchmarks also with already sorted strings. I chose the following strings: In [3]: a = [f'String#{i:08}' for i in range(0, 20000000, 2)] In [4]: b = [f'String#{i:08}' for i in range(1, 20000001, 2)] In [5]: a[:5] Out[5]: ['String#00000000', 'String#00000002', 'String#00000004', 'String#00000006', 'String#00000008'] In [6]: b[:5] Out[6]: ['String#00000001', 'String#00000003', 'String#00000005', 'String#00000007', 'String#00000009'] First with the set ops: In [7]: timeit('list(set(a).difference(b))', number=1, globals=locals()) Out[7]: 3.9231181000000106 In [8]: timeit('list(set(a).difference(b))', number=1, globals=locals()) Out[8]: 2.642592999999991 In [9]: timeit('list(set(a).difference(b))', number=1, globals=locals()) Out[9]: 2.648031700000004 Does not really differ from the previous results. Then with the `list_relcomp` (note 2x `False` in the argument list to disable the sorting): In [10]: a = [f'String#{i:08}' for i in range(0, 20000000, 2)] In [11]: b = [f'String#{i:08}' for i in range(1, 20000001, 2)] In [12]: timeit('list_relcomp(a, b, False, False)', number=1, globals=locals()) Out[12]: 4.851956800000011 In [13]: timeit('list_relcomp(a, b, False, False)', number=1, globals=locals()) Out[13]: 4.8776169999999865 One interesting thing I noticed, when running `list_relcomp` with a trivial input, e.g. (where a = c, but in different objects): In [16]: a = [f'String#{i:08}' for i in range(0, 20000000, 2)] In [17]: c = [f'String#{i:08}' for i in range(0, 20000000, 2)] In [18]: timeit('list_relcomp(a, c, False, False)', number=1, globals=locals()) Out[18]: 2.8997035999999525 It ends up running much faster, because it does not have to build the resulting list (set implementation performance does not change much in this case) while in the first setup, everything in A goes into the result. So there are many details, but unless the use case can guarantee them, I am not sure they are worthy the speculations. But I think the real win for already-sorted data is laziness. With sets, at
If you move everything to the iterators and maybe even dilute the operations over some async I/O it may be pretty effective, but when talking about the algorithm alone, I am not sure even the pre-sorted lists have an advantage, at least not in this implementation, maybe when implemented in C? Richard M.

On Sep 22, 2019, at 12:50, Richard Musil <risa2000x@gmail.com> wrote:
On Sun, Sep 22, 2019 at 8:25 PM Andrew Barnert <abarnert@yahoo.com> wrote:
One case I think you didn’t test is when the strings are generated in already-sorted order.
For the sake of completeness I did some benchmarks also with already sorted strings.
I’m a bit surprised that it didn’t make a huge difference—but only a bit; it really is hard to predict performance when you get down to this level of nitpicking allocation with this many layers of abstraction in the way.
But I think the real win for already-sorted data is laziness. With sets, at least one of your data sets has to be strict, so you can put it in a set. With sorted lists, with a minor rewrite to the algorithms (which I posted earlier for intersection), you can use them directly on iterators instead. Imagine you’re reading 7GB worth of (already-sorted) strings and comparing to 2GB worth of strings on a machine with 8GB of RAM and plenty of swap space. Clearly the set, sorted list, or even anything clever that pandas might do, they’ll all throw your machine into swap hell, which will probably swamp the cost of using step-compare instead of sets and of using Iterators instead of lists. This one might be harder to benchmark (because the cost of the Iterator is going to end up threaded through the cost of the algorithm), but it seems like it should be a huge win over sets.
If you move everything to the iterators and maybe even dilute the operations over some async I/O it may be pretty effective, but when talking about the algorithm alone, I am not sure even the pre-sorted lists have an advantage, at least not in this implementation, maybe when implemented in C?
You don’t really have to do anything to move the operations to the iterators. That’s the whole beauty of generator expressions, itertools, etc.: pile on as many Iterator transforms as you like, including things like union or difference, and the resulting Iterator automatically interleaves the work lazily on demand. Which works even if the inputs are coming in from some kind of external IO. (Making it work with push-model asyncio or similar frameworks does require rewriting things slightly, however.) I slapped together a complete (but barely tested, so you might want to verify correctness) implementation at https://github.com/abarnert/iterset if you want to benchmark it. It’s not at all optimized (in fact, there will usually be two extra generators in the path to each value, plus the overhead for checking and not calling a nonexistent key function, etc.), so I’m sure it’ll be slowed down by a considerable constant modifier for the in-memory case. But if you run it against two generators of strings, compared to the cost of building a set or two lists with the whole thing in memory, I wouldn’t be surprised if it overtakes them before you even get to swap hell or MemoryError. (Assuming you consume the result iteratively, of course.) And it will definitely beat them when you do get to that size. Of course if you really do need to intersect two massive sorted inputs that aren’t already in memory, I’ll bet there’s some kind of pandas/HDF way to do it that buffers up large chunks and optimizes the hell out of everything and blows my code away.

Let me go back to the top here. On Sep 18, 2019, at 12:29, Richard Higginbotham <higginbr@gmail.com> wrote:
I have frequently come across cases where I would like to compare items in one list in another similar to relational algebra.
I’ve put together an itertools-style implementation at https://github.com/abarnert/iterset if anyone’s interested. It includes merge, union, intersection, difference, symmetric_difference, issubset, and issuperset, all taking two sorted iterables and a key function. While all of these things (except merge) can be done by just putting one iterable into a set (or Counter, if you need dups) and passing the other to a method, there are plenty of reasons to prefer these functions in some cases: preserving input order, taking key functions, working on non-hashable values, chaining up itertools-style with other Iterator transforms, etc. Notice that the C++ standard library has exactly the same set of functions at the same level of flexibility (they take C++ iterator ranges, and an optional less-than function), despite having methods on (both unordered hash and ordered tree) sets, for much the same reasons. Also notice that itertools already has at least one function that expects to work on sorted iterables, groupby. And that in general itertools is intended to provide an “iterator algebra” that this fits into nicely. So, I think they’re useful enough to exist somewhere, possibly in itertools, possibly in a third-party library (although maybe as part of more-itertools and/or toolz rather than on their own). It’s barely tested, not at all optimized, not fully documented, and generally not ready to turn into a patch to the stdlib, more-itertools, or toolz or into a PyPI package. But if anyone wants it to be polished up to that level, let me know and I could probably find time this week. (If not, I’ll just keep it in my toolbox; and maybe come back to it next time I need it.) I haven’t tested performance at all. I expect it to be slower (when used with already-sorted data that’s already in a list) than Richard Higginbotham’s list implementation by a potentially hefty multiplier. From a quick&dirty test, I think the multiplier can be brought way down (albeit at significant cost to readability), but it’ll still be higher than 1. And that’s already slower than sets for at least most uses, as Richard Musli has demonstrated. Of course for huge data where reading everything into memory means swap hell, it’ll be a whole lot faster, but if you really need that, I’ll bet pandas/HDF5 has even better alternatives. Interleaving the work with the cost of reading or constructing the values might help performance in some cases, but that’s hard to generalize about. Overall, you probably want them when you need to preserve input order, or use a key function, etc., not when sets aren’t fast enough.

Hey Everybody, I do want to thank you for this souvenir. I've having it blown up and mounted on the wall. I'm going to leave some space for the benchmarks that compare my module and his with the standard set implementation. "I haven’t tested performance at all. I expect it to be slower (when used with already-sorted data that’s already in a list) than Richard Higginbotham’s list implementation by a potentially hefty multiplier. From a quick&dirty test, I think the multiplier can be brought way down (albeit at significant cost to readability), but it’ll still be higher than 1. " -- Andrew Barnert, Python project. See Andrew that's how you do sarcasm. None of that is a lie, I'm really gonna do it.

What if, instead, there was an `OrderedBag` class that acts like a `list` and supports set operations. The advantage of this is that any hidden indexing that is used to optimize the operations could be efficiently created in the object that results from the set operation during its construction while performing the operation. Using lists, any such indexes would have to be produced and discarded for each operation (or could perhaps be kept in weak references or something like that, but that's messy).

On Oct 13, 2019, at 19:15, Steve Jorgensen <stevej@stevej.name> wrote:
What if, instead, there was an `OrderedBag` class that acts like a `list` and supports set operations.
The advantage of this is that any hidden indexing that is used to optimize the operations could be efficiently created in the object that results from the set operation during its construction while performing the operation. Using lists, any such indexes would have to be produced and discarded for each operation (or could perhaps be kept in weak references or something like that, but that's messy).
This sounds doable, but nontrivial to design and implement, and with a lot of tradeoffs to consider. I’m not sure there is a one-size-fits-all implementation that would be good enough for most uses, but if there is, you probably need to find it and at least give us a proof of concept. For example, when I’ve needed to do the kind of thing you seem to be talking about, it’s often because I have a (multi)set of values with different orderings on it. Like a database table with multiple indices. If you don’t have any ordering except for a partial ordering on hash/equality, you probably wouldn’t want the same thing I do.

Steve Jorgensen wrote:
Lets look at the test results from earlier for 50 million and 10 million item lists. For A not in B: trying A 1000000, B 5000000 best relcomp_set test time (msec): 82.131 best relcomp_set with set init test time (msec): 880.753 best relcomp_set with set init intersect test time (msec): 886.761 best list_relcomp test time (msec): 7.902 best sort test time (msec): 5.826 For A not in B: trying A 5000000, B 10000000 best relcomp_set test time (msec): 557.368 best relcomp_set with set init test time (msec): 3228.475 best relcomp_set with set init intersect test time (msec): 3725.047 best list_relcomp test time (msec): 25.626 best sort test time (msec): 13.632 Changing the order of set operations based on which set is largest is almost 10x the speed improvement. Should set operations be changed? It's probably faster no matter the order to convert them into lists and use my algorithm after about 10 million. In the end though we are still talking about a few seconds at most. If it worth it to you for that improvement that's one thing, but you should consider what you really need. Without clear performance/use case criteria I think you'll never leave the bike-shed here.
participants (10)
-
Andrew Barnert
-
Chris Angelico
-
Dominik Vilsmeier
-
Kyle Lahnakoski
-
Rhodri James
-
Richard Higginbotham
-
Richard Musil
-
Steve Jorgensen
-
Steven D'Aprano
-
Tim Peters