[Python-ideas] Re: Set operations with Lists

20 Sep 2019

      On Fri, Sep 20, 2019 at 1:52 AM Andrew Barnert  wrote:
...
On Sep 19, 2019, at 15:18, Richard Musil  wrote:
...
Ok, I misread the original code, the lists were not sorted in the
previous results (and their should be). So with the correction to have them
sorted,
I think to be fair, you want to show it _both_ ways, just as you’re
showing sets both with and without creating the sets. Because sometimes
you’re already going to have sorted lists, just as sometimes you’re already
going to have sets.
Right. It just was giving wrong results and I was not sure if the algorithm
on "particularly unsorted" list would be representative (because it would
produce the wrong results), so I removed it. I added a new test
`list_relcomp[presorted]` with pre-sorted (just to have the results
correct) to see if "making it correct" has any impact.

So there are four things to compare:
...
* set operation on existing sets
    * set operation including time to build the sets
    * step-compare operation on pre-sorted lists
    * step-compare including time to sort the lists
(Also, for the set operation, there’s no need to build _two_ sets. Just
build a set of one and intersect it with the other list.)
Right (again). When I was considering the algorithm on the lists, I thought
that technically, I could build just one set on the fly, but the other one
has to be at least hashed to test the inclusion, and I also assumed that
the hash operation will be dominant in this and basically told myself that
the cost of building the other set will not be significant. I just added
another implementation to the test (which I believe you had on mind):
```
def relcomp_set_list(a, b):
    bset = set(b)
    return [ia for ia in a if ia not in bset]
```
and confirmed that avoiding the 2nd set build actually brings a noticeable
improvement :).

I also changed the naming as the original code used `relcomp_set` to mark
set operation 'a - b', but returned the results as the list, which seemed
to me a bit confusing, so I renamed it to `relcomp_set_to_list` (do not
confuse it with `relcomp_set_list` mentioned above) and put there
`relcomp_set`, which does pure set 'a - b' and returns a set. So one can
see the overhead of the resulting list construction in the original
`relcomp_set`. The results are:

For A not in B:
trying A 1000, B 5000
naive_relcomp (msec): 38.105
relcomp_set (msec): 0.025
relcomp_set_to_list (msec): 0.034
relcomp_set with set init (msec): 0.215
relcomp_set_list (msec): 0.192
list_relcomp (msec): 0.329
list_relcomp[presorted] (msec): 0.214

For A not in B:
trying A 10000, B 50000
relcomp_set (msec): 0.416
relcomp_set_to_list (msec): 0.545
relcomp_set with set init (msec): 3.167
relcomp_set_list (msec): 2.546
list_relcomp (msec): 3.372
list_relcomp[presorted] (msec): 2.005

For A subset of B:
trying A 100000, B 500000
relcomp_set (msec): 8.557
relcomp_set_to_list (msec): 8.519
relcomp_set with set init (msec): 55.783
relcomp_set_list (msec): 48.903
list_relcomp (msec): 143.577
list_relcomp[presorted] (msec): 120.911

There is still one peculiarity in the last test, where relcomp_set and
relcomp_set_to_list gave basically the same result. I interpret is as that
the result is of a small size, basically insignificant to the other
operations in play.

Richard M.