[Python-ideas] Re: Set operations with Lists

14 Oct 2019

      Steve Jorgensen wrote:
...
What if, instead, there was an OrderedBag class that acts like a
list and supports set operations.
The advantage of this is that any hidden indexing that is used to optimize the
operations could be efficiently created in the object that results from the set operation
during its construction while performing the operation. Using lists, any such indexes
would have to be produced and discarded for each operation (or could perhaps be kept in
weak references or something like that, but that's messy).
It seems like it would depend on the memory layout. If the items weren't stored in order then it would most likely suffer from cache misses the same way set operations do. I'm not sure if it would be slower than set operations in that case. Most likely there would be a curve in run time where sets were faster for some small number of items and slower for large collections similar to the list implementation. If you added binary searches it might be faster in the majority of cases but that's a lot of variables. 
The big question then becomes what criteria does it need to meet and is that quantified? It doesn't seem like that quantification is a consideration of this list. Unless you have a particular itch to scratch that is a lot of work. Adding enough general testing scenarios to satisfy a consensus is probably much much more.
Lets look at the test results from earlier for 50 million and 10 million item lists.
For A not in B:
trying A 1000000, B 5000000
best relcomp_set test time (msec): 82.131
best relcomp_set with set init test time (msec): 880.753
best relcomp_set with set init intersect test time (msec): 886.761
best list_relcomp test time (msec): 7.902
best sort test time (msec): 5.826

For A not in B:
trying A 5000000, B 10000000
best relcomp_set test time (msec): 557.368
best relcomp_set with set init test time (msec): 3228.475
best relcomp_set with set init intersect test time (msec): 3725.047
best list_relcomp test time (msec): 25.626
best sort test time (msec): 13.632

Changing the order of set operations based on which set is largest is almost 10x the speed improvement. Should set operations be changed? It's probably faster no matter the order to convert them into lists and use my algorithm after about 10 million. In the end though we are still talking about a few seconds at most. If it worth it to you for that improvement that's one thing, but you should consider what you really need. Without clear performance/use case criteria I think you'll never leave the bike-shed here.

[Python-ideas] Re: Set operations with Lists

Richard Higginbotham