Andrew Barnert wrote:
On Sep 23, 2019, at 15:32, Richard Higginbotham higginbr@gmail.com wrote:
Considering your use case however, I wonder, if you would not be better going with the iterator approach (as Andrew has been hinting already for some time in his posts). Richard M. They will most likely have good performance on data sets that are close, but very poor on those that aren't balanced. The code you posted doesn’t do any galloping, it just advances the index one at a time. The fact that you could do approximate binary searches to optimize it doesn’t help if you never do those searches. So it’ll be exactly as poor on unbalanced Yes it probably will, but we won't be able to add binary searches using iterators. In which case, why use iterators... It's a pretty basic optimization step.
inputs as my Iterator code. After all, they’re exactly the same algorithm (except for trivial differences in how they handle one input or the other ending early); the only difference is that one implementation calls next and remembers the value, the other increases an index and indexes a list. I'm not bragging about my code. I think its neat, but I wrote it 20 years ago after graduating from college with a chemical engineering degree. It was my second python program and like 3rd or 4th program outside re-implementing the C library (as is common with students).
Of course if you always have lists already in memory, the added overhead of using an Iterator over the list is cost for no benefit. But that’s a fixed multiplier, it’s not different for balanced vs. unbalanced inputs. And if course if the Iterator ever allows you to avoid building an otherwise-unnecessary list, it’s a win, but that win is also a flat cost that depends only on the allocation time for N elements, which also doesn’t vary based on balanced vs. unbalanced inputs. Anyway, if your use case really is a gigantic directory from a filesystem that guarantees sorted order and it supports iterdir, iterators should help quite a bit.
In the spirit of comradery, I'll try to follow Steven D'Aprano's community advice and be a little more blunt. I'll skip the insults though. I wrote them but I took them out. Honestly, I could have done this at any time I just didn't want to hurt peoples feelings. I already converted it to C along time ago and did many many test I don't want to repeat. I know how it scales. We don't even have to convert to C. Let's just use pypy. For grins we'll change to an integer hash. For A not in B: trying A 1000, B 5000 best relcomp_set test time (msec): 0.009 best relcomp_set with set init test time (msec): 0.130 best relcomp_set with set init intersect test time (msec): 0.115 best list_relcomp test time (msec): 0.012 best sort test time (msec): 0.006 For A not in B: trying A 10000, B 50000 best relcomp_set test time (msec): 0.117 best relcomp_set with set init test time (msec): 2.448 best relcomp_set with set init intersect test time (msec): 2.649 best list_relcomp test time (msec): 0.077 best sort test time (msec): 0.048 For A not in B: trying A 1000000, B 5000000 best relcomp_set test time (msec): 82.131 best relcomp_set with set init test time (msec): 880.753 best relcomp_set with set init intersect test time (msec): 886.761 best list_relcomp test time (msec): 7.902 best sort test time (msec): 5.826 For A not in B: trying A 5000000, B 10000000 best relcomp_set test time (msec): 557.368 best relcomp_set with set init test time (msec): 3228.475 best relcomp_set with set init intersect test time (msec): 3725.047 best list_relcomp test time (msec): 25.626 best sort test time (msec): 13.632 Huh? At about 1000 elements they are equal but after that my 2nd python program is 10x faster, oops. Lets try more iterations "with a real Benchmark program". I won't repeat mine though well just take the first number that comes by. trying A 1000, B 5000 best relcomp_set test time (msec): 0.009 best relcomp_set with set init test time (msec): 0.116 best relcomp_set with set init intersect test time (msec): 0.118 best list_relcomp test time (msec): 0.015 best sort test time (msec): 0.006 For A not in B: trying A 10000, B 50000 best relcomp_set test time (msec): 0.115 best relcomp_set with set init test time (msec): 2.729 best relcomp_set with set init intersect test time (msec): 2.780 best list_relcomp test time (msec): 0.111 best sort test time (msec): 0.049 You don't want to see data sets higher than that. Ok, a little sarcasm but you earned it. All you had to do was not be jerks.