On Fri, Sep 20, 2019 at 1:52 AM Andrew Barnert
On Sep 19, 2019, at 15:18, Richard Musil
wrote: Ok, I misread the original code, the lists were not sorted in the
previous results (and their should be). So with the correction to have them sorted,
I think to be fair, you want to show it _both_ ways, just as you’re showing sets both with and without creating the sets. Because sometimes you’re already going to have sorted lists, just as sometimes you’re already going to have sets.
Right. It just was giving wrong results and I was not sure if the algorithm on "particularly unsorted" list would be representative (because it would produce the wrong results), so I removed it. I added a new test `list_relcomp[presorted]` with pre-sorted (just to have the results correct) to see if "making it correct" has any impact. So there are four things to compare:
* set operation on existing sets * set operation including time to build the sets * step-compare operation on pre-sorted lists * step-compare including time to sort the lists
(Also, for the set operation, there’s no need to build _two_ sets. Just build a set of one and intersect it with the other list.)
Right (again). When I was considering the algorithm on the lists, I thought that technically, I could build just one set on the fly, but the other one has to be at least hashed to test the inclusion, and I also assumed that the hash operation will be dominant in this and basically told myself that the cost of building the other set will not be significant. I just added another implementation to the test (which I believe you had on mind): ``` def relcomp_set_list(a, b): bset = set(b) return [ia for ia in a if ia not in bset] ``` and confirmed that avoiding the 2nd set build actually brings a noticeable improvement :). I also changed the naming as the original code used `relcomp_set` to mark set operation 'a - b', but returned the results as the list, which seemed to me a bit confusing, so I renamed it to `relcomp_set_to_list` (do not confuse it with `relcomp_set_list` mentioned above) and put there `relcomp_set`, which does pure set 'a - b' and returns a set. So one can see the overhead of the resulting list construction in the original `relcomp_set`. The results are: For A not in B: trying A 1000, B 5000 naive_relcomp (msec): 38.105 relcomp_set (msec): 0.025 relcomp_set_to_list (msec): 0.034 relcomp_set with set init (msec): 0.215 relcomp_set_list (msec): 0.192 list_relcomp (msec): 0.329 list_relcomp[presorted] (msec): 0.214 For A not in B: trying A 10000, B 50000 relcomp_set (msec): 0.416 relcomp_set_to_list (msec): 0.545 relcomp_set with set init (msec): 3.167 relcomp_set_list (msec): 2.546 list_relcomp (msec): 3.372 list_relcomp[presorted] (msec): 2.005 For A subset of B: trying A 100000, B 500000 relcomp_set (msec): 8.557 relcomp_set_to_list (msec): 8.519 relcomp_set with set init (msec): 55.783 relcomp_set_list (msec): 48.903 list_relcomp (msec): 143.577 list_relcomp[presorted] (msec): 120.911 There is still one peculiarity in the last test, where relcomp_set and relcomp_set_to_list gave basically the same result. I interpret is as that the result is of a small size, basically insignificant to the other operations in play. Richard M.