On Fri, Sep 20, 2019 at 1:52 AM Andrew Barnert <abarnert@yahoo.com> wrote:
On Sep 19, 2019, at 15:18, Richard Musil <risa2000x@gmail.com> wrote:
>
> Ok, I misread the original code, the lists were not sorted in the previous results (and their should be). So with the correction to have them sorted,

I think to be fair, you want to show it _both_ ways, just as you’re showing sets both with and without creating the sets. Because sometimes you’re already going to have sorted lists, just as sometimes you’re already going to have sets.

Right. It just was giving wrong results and I was not sure if the algorithm on "particularly unsorted" list would be representative (because it would produce the wrong results), so I removed it. I added a new test `list_relcomp[presorted]` with pre-sorted (just to have the results correct) to see if "making it correct" has any impact.

So there are four things to compare:

    * set operation on existing sets
    * set operation including time to build the sets
    * step-compare operation on pre-sorted lists
    * step-compare including time to sort the lists

(Also, for the set operation, there’s no need to build _two_ sets. Just build a set of one and intersect it with the other list.)

Right (again). When I was considering the algorithm on the lists, I thought that technically, I could build just one set on the fly, but the other one has to be at least hashed to test the inclusion, and I also assumed that the hash operation will be dominant in this and basically told myself that the cost of building the other set will not be significant. I just added another implementation to the test (which I believe you had on mind):
```
def relcomp_set_list(a, b):
    bset = set(b)
    return [ia for ia in a if ia not in bset]
```
and confirmed that avoiding the 2nd set build actually brings a noticeable improvement :).

I also changed the naming as the original code used `relcomp_set` to mark set operation 'a - b', but returned the results as the list, which seemed to me a bit confusing, so I renamed it to `relcomp_set_to_list` (do not confuse it with `relcomp_set_list` mentioned above) and put there `relcomp_set`, which does pure set 'a - b' and returns a set. So one can see the overhead of the resulting list construction in the original `relcomp_set`. The results are:

For A not in B:
trying A 1000, B 5000
naive_relcomp (msec): 38.105
relcomp_set (msec): 0.025
relcomp_set_to_list (msec): 0.034
relcomp_set with set init (msec): 0.215
relcomp_set_list (msec): 0.192
list_relcomp (msec): 0.329
list_relcomp[presorted] (msec): 0.214

For A not in B:
trying A 10000, B 50000
relcomp_set (msec): 0.416
relcomp_set_to_list (msec): 0.545
relcomp_set with set init (msec): 3.167
relcomp_set_list (msec): 2.546
list_relcomp (msec): 3.372
list_relcomp[presorted] (msec): 2.005

For A subset of B:
trying A 100000, B 500000
relcomp_set (msec): 8.557
relcomp_set_to_list (msec): 8.519
relcomp_set with set init (msec): 55.783
relcomp_set_list (msec): 48.903
list_relcomp (msec): 143.577
list_relcomp[presorted] (msec): 120.911
 

There is still one peculiarity in the last test, where relcomp_set and relcomp_set_to_list gave basically the same result. I interpret is as that the result is of a small size, basically insignificant to the other operations in play.

Richard M.