[Python-ideas] Re: Set operations with Lists

20 Sep 2019

      On Sep 20, 2019, at 09:21, Richard Higginbotham  wrote:
...
Richard Musil wrote:
...
...
On Fri, Sep 20, 2019 at 1:52 AM Andrew Barnert abarnert@yahoo.com wrote:
On Sep 19, 2019, at 15:18, Richard Musil risa2000x@gmail.com wrote:
After some thought I don't think the the time comparisons are useful unless we include the set init time. If we already have two sets, would anyone want to compare and convert that into a list? In the use case where we read a list of files, wouldn't we want to put that into a list instead of a set? I would think we would only put it into a list if we knew we would be doing set operations.
And?

If I want to get a-b, a&b, and b-a, as I’d need to for a directory diff, I’m not going to do it by keeping them in lists and converting to sets three times.

Just like I’m not going to keep them unsorted and sort them three times to do step-compare operations.

So the cost of the operations themselves, separate from the setup time, clearly does matter.
...
At that point we would need to compare adding the elements individually to a list versus a set to then compare the set operations with list operations fairly.
Why would you be adding them individually? Surely, at least if you care this much about micro-optimization, you’re going to be adding them in bulk. If you’re reading a file full of file names, you’d just do list(f) or set(f). If you’re reading 10 files full of file names, or getting them out of os.walk steps, you’re going to call extend or union with each batch, not iterate one by one and add them.

Of course the actual details depend on the actual use case, and I’m not sure why you’re trying to come up with a microbenchmark that exactly fits some imaginary and ill-defined use in the first place. Do you then need to go simulating a filesystem’s readdir performance characteristics in some particular but arbitrary scenario?

What’s the goal here?

I think it’s pretty easy to establish that sets are fast but they’re not magic, sorting is pretty good but it’s not magic, there are some cases where step-compare is faster if you have already-sorted lists… and none of this matters anyway, because in real life there are very few use cases where you can equally choose between sets and step compare and need to micro-optimize, but lots of cases where you have _other_ reasons to choose between them.

Sets work on non-comparable values; step-compare works on non-hashable values. Step-compare (if implemented right) works on lazy iterables; sets can preserve the order of the “master”input. They do different things with duplicates. Even when performance is the only issue (and it’s not dominated by the file system anyway, as it would be in your case), the characteristics of your specific data are going to make a huge difference. Not to mention that there are often other compelling reasons to have a set or a sorted list, at which point the choice is made for you.

So, this already eliminates the possibility of reimplementing set operations with step-compare as suggested earlier. But it also already makes a compelling case for adding step-compare functions to Python—but that case needs to be made clearly.

I think they’d fit very nicely in itertools, because they already provide an algebra on iterables, including functions like group you that expect to be handed pre-sorted iterables, but maybe not everyone agrees. I don’t know whether their usefulness (and stability) is so obvious that they don’t need to go on PyPI to prove themselves. I also don’t know whether they already exist on PyPI (maybe as part of more-itertools and/or toolz?). Also, I’m pretty sure there are questions on StackOverflow and elsewhere from people trying to do step-compare and getting it wrong, which could be another argument for having it in the stdlib, or at least the recipes in the docs, but only if someone digs up the questions. And there can be bikeshedding discussions about exactly which operations are needed, and what to call them. But further artificial micro benchmarks aren’t going to help resolve any of those questions.