Steve Jorgensen wrote:
What if, instead, there was an OrderedBag class that acts like a list and supports set operations. The advantage of this is that any hidden indexing that is used to optimize the operations could be efficiently created in the object that results from the set operation during its construction while performing the operation. Using lists, any such indexes would have to be produced and discarded for each operation (or could perhaps be kept in weak references or something like that, but that's messy). It seems like it would depend on the memory layout. If the items weren't stored in order then it would most likely suffer from cache misses the same way set operations do. I'm not sure if it would be slower than set operations in that case. Most likely there would be a curve in run time where sets were faster for some small number of items and slower for large collections similar to the list implementation. If you added binary searches it might be faster in the majority of cases but that's a lot of variables. The big question then becomes what criteria does it need to meet and is that quantified? It doesn't seem like that quantification is a consideration of this list. Unless you have a particular itch to scratch that is a lot of work. Adding enough general testing scenarios to satisfy a consensus is probably much much more.
Lets look at the test results from earlier for 50 million and 10 million item lists. For A not in B: trying A 1000000, B 5000000 best relcomp_set test time (msec): 82.131 best relcomp_set with set init test time (msec): 880.753 best relcomp_set with set init intersect test time (msec): 886.761 best list_relcomp test time (msec): 7.902 best sort test time (msec): 5.826 For A not in B: trying A 5000000, B 10000000 best relcomp_set test time (msec): 557.368 best relcomp_set with set init test time (msec): 3228.475 best relcomp_set with set init intersect test time (msec): 3725.047 best list_relcomp test time (msec): 25.626 best sort test time (msec): 13.632 Changing the order of set operations based on which set is largest is almost 10x the speed improvement. Should set operations be changed? It's probably faster no matter the order to convert them into lists and use my algorithm after about 10 million. In the end though we are still talking about a few seconds at most. If it worth it to you for that improvement that's one thing, but you should consider what you really need. Without clear performance/use case criteria I think you'll never leave the bike-shed here.