[Python-ideas] Re: Set operations with Lists

24 Sep 2019

      Andrew Barnert wrote:
...
On Sep 23, 2019, at 15:32, Richard Higginbotham higginbr@gmail.com wrote:
...
Considering
your use case however, I wonder, if you would not be better
going with the iterator approach (as Andrew has been hinting already for
some time in his posts).
Richard M.
They will most likely have good performance on data sets that are close, but very poor on
those that aren't balanced.
The code you posted doesn’t do any galloping, it just advances the index
one at a time. The fact that you could do approximate binary searches to optimize it
doesn’t help if you never do those searches. So it’ll be exactly as poor on unbalanced
Yes it probably will, but we won't be able to add binary searches using iterators. In which case, why use iterators... It's a pretty basic optimization step.
...
inputs as my Iterator code. After all, they’re exactly the same algorithm (except for
trivial differences in how they handle one input or the other ending early); the only
difference is that one implementation calls next and remembers the value, the other
increases an index and indexes a list.
I'm not bragging about my code. I think its neat, but I wrote it 20 years ago after graduating from college with a chemical engineering degree. It was my second python program and like 3rd or 4th program outside re-implementing the C library (as is common with students).
...
Of course if you always have lists already in memory, the added overhead of using an
Iterator over the list is cost for no benefit. But that’s a fixed multiplier, it’s not
different for balanced vs. unbalanced inputs. And if course if the Iterator ever allows
you to avoid building an otherwise-unnecessary list, it’s a win, but that win is also a
flat cost that depends only on the allocation time for N elements, which also doesn’t vary
based on balanced vs. unbalanced inputs.
Anyway, if your use case really is a gigantic directory from a filesystem that
guarantees sorted order and it supports iterdir, iterators should help quite a bit.
In the spirit of comradery, I'll try to follow Steven D'Aprano's community advice and be a little more blunt. I'll skip the insults though. I wrote them but I took them out. Honestly, I could have done this at any time I just didn't want to hurt peoples feelings. I already converted it to C along time ago and did many many test I don't want to repeat. I know how it scales. We don't even have to convert to C. Let's just use pypy. For grins we'll change to an integer hash. 

For A not in B:
trying A 1000, B 5000
best relcomp_set test time (msec): 0.009
best relcomp_set with set init test time (msec): 0.130
best relcomp_set with set init intersect test time (msec): 0.115
best list_relcomp test time (msec): 0.012
best sort test time (msec): 0.006

For A not in B:
trying A 10000, B 50000
best relcomp_set test time (msec): 0.117
best relcomp_set with set init test time (msec): 2.448
best relcomp_set with set init intersect test time (msec): 2.649
best list_relcomp test time (msec): 0.077
best sort test time (msec): 0.048

For A not in B:
trying A 1000000, B 5000000
best relcomp_set test time (msec): 82.131
best relcomp_set with set init test time (msec): 880.753
best relcomp_set with set init intersect test time (msec): 886.761
best list_relcomp test time (msec): 7.902
best sort test time (msec): 5.826

For A not in B:
trying A 5000000, B 10000000
best relcomp_set test time (msec): 557.368
best relcomp_set with set init test time (msec): 3228.475
best relcomp_set with set init intersect test time (msec): 3725.047
best list_relcomp test time (msec): 25.626
best sort test time (msec): 13.632

Huh? At about 1000 elements they are equal but after that my 2nd python program is 10x faster, oops. Lets try more iterations "with a real Benchmark program". I won't repeat mine though well just take the first number that comes by.

trying A 1000, B 5000
best relcomp_set test time (msec): 0.009
best relcomp_set with set init test time (msec): 0.116
best relcomp_set with set init intersect test time (msec): 0.118
best list_relcomp test time (msec): 0.015
best sort test time (msec): 0.006

For A not in B:
trying A 10000, B 50000
best relcomp_set test time (msec): 0.115
best relcomp_set with set init test time (msec): 2.729
best relcomp_set with set init intersect test time (msec): 2.780
best list_relcomp test time (msec): 0.111
best sort test time (msec): 0.049

You don't want to see data sets higher than that. Ok, a little sarcasm but you earned it.

All you had to do was not be jerks.