[Python-ideas] Re: Set operations with Lists

19 Sep 2019

      On Thu, Sep 19, 2019 at 10:18 PM Richard Higginbotham 
wrote:
...
I'm sorry I haven't used the mail list before and didn't send my comments
to the list. Please see my response to Chris for some example times. I'm
reluctant to use anything less than about a ms for timing purposes. There's
too many interactions at the us level that can skew results. I'd prefer to
expand the sample size until it's more clear. Here is an example
For A not in B:
trying A 10000, B 50000
naive test  0:00:10.319032
set test  0:00:00.009001
listex test  0:00:00.005000
For A subset of B:
trying A 100000, B 500000
set test  0:00:00.180018
listex test  0:00:00.054006
Richard, I took your listex.py and modified it to use timeit module for
benchmarking and put it here (
https://gist.github.com/risa2000/c44c1a06e89f348d82efcce5ec722774).

The results I got from the modified code were:

For A not in B:
trying A 1000, B 5000
best naive_relcomp test time (msec): 38.125
best relcomp_set test time (msec): 0.034
best relcomp_set with set init test time (msec): 0.218
best list_relcomp test time (msec): 0.211

For A not in B:
trying A 10000, B 50000
best relcomp_set test time (msec): 0.545
best relcomp_set with set init test time (msec): 3.093
best list_relcomp test time (msec): 2.101

For A subset of B:
trying A 100000, B 500000
best relcomp_set test time (msec): 8.508
best relcomp_set with set init test time (msec): 55.001
best list_relcomp test time (msec): 29.242

Now, what is interesting about this is the benchmark I added: calculating
the set difference while including the set setup (basically the set(list)
call). This is the line with "best relcomp_set with set init".

The results for those not familiar with Richard's listex.py are:
1) 'relcomp_set' is calculating set difference by using subtraction
operator (a - b) on two sets built from the two unsorted lists.
2) 'relcomp_set with set init' is the same function but with the set setup
included, so basically measuring set(list_a) - set(list_b)
3) 'list_relcomp' is the function using Richard's implementation of the
difference on the lists (again unsorted).

I did not verify the results (I hope they are correct), but the timing
suggest that the set operation is always faster. When however we add the
set setup to the mix, the build up time, which is O(n + m) becomes more
significant than actually sorting and comparing the lists which is
technically of O(nlog(n) + mlog(m)). It would also suggest that for some
large lists the list implementation _can_ actually be faster if we
calculate in an eventual conversion of the lists to the sets and possible
even more faster if the lists were already ordered.

What I cannot explain, why the setup time becomes so significant only at A
10000, B 50000 and does not make much of a difference in case A 1000, B
5000. It would suggest that there is some other factor which has a
different impact depending on the set size.

Richard M.

[Python-ideas] Re: Set operations with Lists

Richard Musil