[Python-ideas] Re: Set operations with Lists

19 Sep 2019

      On Sep 18, 2019, at 12:29, Richard Higginbotham  wrote:
...
I'm not sure if there is any interest by others but I have frequently come across cases where I would like to compare items in one list in another similar to relational algebra.
For anyone who doesn’t want to read this whole reply:

I think a step-compare function is useful often enough. and confuses novices often enough (if StackOverflow is any guide), that it might belong in the stdlib. But I think it should be in itertools, and should work on arbitrary iterables that are assumed to be pre-sorted (an assumption groupby already makes, and explains in the docs). And if it’s not in more-itertools and/or toolz, maybe it’s worth submitting it to them first and seeing how much people use it.

Now on to why:

To start with, itertools is already about providing an iterator algebra that isn’t identical to relational algebra, but is a lot closer than what the list builtin provides.
...
For example are the file names in A in B and if so return a new list with those items. Long story short, I wrote some functions to do that. They are quite simple and fast (due to timsort in no small part).
...
Even the plain python code is faster than the built in set functions (afaik).
You’re testing pretty small values in a very narrow use case, and testing them inaccurately by using datetime instead of timeit (the fact that some of your results are 0 vs. 0 should be a clue…), and you haven’t even given us the results, just claimed that they’re faster.

When I run my own test on lists of 10000 strings made up 5 random lowercase characters, I get 485us for set(a).intersection(b), and 6609us for list_intersection(a, b), so it’s actually more than 10x slower.

What if the values are already sorted, or already in sets? Then I get 175us for a&b, 3250us for list_intersection(a, b, False, False), so again it’s more than 10x slower.

Maybe it’s because my set has a lot more duplicates. What I do 50-character strings, and verify that there are no dups? Now it’s 525us vs. 6699us, or 225us vs. 2910us for pre-prepared values—a bit closer, but still more than 10x slower.

And meanwhile, what if I simulate long pathnames with a common prefix? Now I get 525us vs. 10200us.
...
I created a github and put the ones I thought the community might like in there. https://github.com/ponderpanda/listex
an example would be
a = [1,2,3,4,5,6]
b = [1,3,7,4]
list_intersection(a,b, sorta=True, sortb=True)
returns [1,3,4]
The complexity ends up being about the longer of list a or b.
How can that be true? Sorting a takes nlogn steps, sorting b takes mlogm steps, step-comparing then takes n+m steps, so that’s O(nlogn + mlogm * n + m) = O(nlogn), which is bigger than O(n).

By comparison, making a set of b takes m steps, iterating over a and looking up each value in the hash takes n steps, so it’s O(n+m) = O(n), which is smaller than O(nlogn).

It’s possible that the constant multiplier for hash lookups is so much higher than the multiplier for comparing and nexting that you’d need to get into the zillions before the logn factor outweighs the multiplier, in which case you could argue that it’s a good optimization despite having worse complexity. But you can’t argue that it has better complexity. Also, if that lower constant really is critical to performance, surely it’s not applicable to all values, as the long-pathname case demonstrates.

Plus, given that your code is about 10x as slow even _without_ the sort, it doesn’t seem to be true that the multiplier is better in the first place.

Also, forgetting about performance, sort-and-step doesn’t work on values that aren’t comparable, while set does. On the other hand, set doesn’t work on values that aren’t hashable, while sort-and-step does. Strings and integers are both, but lots of types are only one or the other (list, complex). For something general-purpose enough that you want to add it to builtins, that seems at least worth mentioning.
...
if they are over some thousands the speed difference with set() operations becomes significant. There are some details about naming, unique values (not necessary), and sort order that would probably need to be ironed out if they were to be included with the built in list type.
Why would you want this as a method of list rather than a function that works just as well on any sortable iterables? The reason we have a sort method (as well as, not instead of, a sorted function) is to allow sorting in-place, which isn’t relevant here. We don’t have methods for zip, filter, etc.

Also, in most cases where I’ve wanted to intersect two lists of file names, I want to preserve the order of the first “master” list. You can’t do that with sort-and-step except by doing something pretty complicated (e.g., decorating with enumerate, using a custom compare that ignores the indexes in your intersection, then re-sorting the result by index and undecorating), but it’s trivial with sets:

    filterset = set(b)
    c = [val for val in a if val in filterset]

… or …

    c = filter(set(b).__contains__, a)

And this is still O(n+m) time, just like set(a)&set(b) would be, and with a faster constant because we’re only setifying one of the two iterables (and often the smaller one).

And this has the advantage that the potentially huge “master” input can be a lazy Iterator. If a is a file listing the first names of each American, and b is a much smaller set of the legal names for French children, loading that whole file into memory to sort it would be a bad idea, and it’s not necessary if you use sets.

Also, the last time I wanted to intersect two lists of file names was to diff two directories, so I needed a-b and b-a, not just a&b. Is there really a compelling use case for intersection that isn’t also a compelling use case for at least some of the other set operations? In fact, you’re arguing for just intersection, but then provided an implementation for some of the others. Which do you want, and why that particular set?

Also, if you care about duplicates (and if you don’t care about either order or duplicates, why aren’t you just using a set in the first place?), you need to specify whether this is multiset intersection (so if X appears 5 times in A and 2 times in B it shows up 2 times in A&B), or set intersection (so it shows up just once). The most obvious way of doing sort-and-step is going to do neither of those (it’ll be 5, because it appears 5 times in A and each of those has a matching value in B); you could implement either pretty easily, but only if you decide which you want and then do it. With sets, you explicitly make the choice of set or Counter, and then whether you get set or multiset behavior is obvious; here, it’s not obvious, and if you want it to be a choice you have to add another flag argument.

[Python-ideas] Re: Set operations with Lists

Andrew Barnert