[Python-ideas] Re: Set operations with Lists

21 Sep 2019


      Andrew Barnert wrote:
...
On Sep 20, 2019, at 14:51, Richard Higginbotham higginbr@gmail.com wrote:
...
Andrew Barnert wrote:
What’s the goal here?
I'm not concerned about micro op, I'm concerned with macro op on large data sets.
Right now if someone comes to you with that simple use case of two large lists you have to
tell them to convert to sets and then back again. Forget the new programmers who tried the
naive way and quite because "python is too slow", they are waiting 20 seconds  when we
could deliver it to them in 1 second.
But today, you can deliver it to them in 1 second: just give them
set(b).intersection(a) or set(a) & set(b) or sb =
set(b) then [x for x in a if x in sb] and you’re done. They can easily
understand why it works. If they want to know why it’s faster, you can easily explain it,
and they’ve learned something widely useful.
This isn't technically correct. It's not faster. It all depends on the use case which when it contradicts your expectations you just deride as "artifical micro benchmarks". Python isn't just used as a toy scripting language. 50 million elements in a collection is not even large by a long shot at least from where I sit. You can make a case that it's not a good language for that type of problem, say HPC clusters. Or you can tell people to go copy some C code "like people have been doing for decades". That you ask if that is a proper response to your users is very concerning to me.
...
If we add a new function, whether it’s in builtins or itertools or elsewhere, then you
can give them that function and you’re done; that’s no easier or harder. They probably
won’t immediately know how it works, but if they ask, they’ll learn something useful. And
once they get why it works, they probably will get why it’s faster. So, overall, the new
solution is roughly the same benefits as the existing one that’s been there since Python
2.3.
...
I think it’s
pretty easy to establish that sets are fast but they’re not magic, sorting
is pretty good but it’s not magic, there are some cases where step-compare is faster if
you have already-sorted lists… and none of this matters anyway, because in real life
there
are very few use cases where you can equally choose between sets and step compare and
need
to micro-optimize, but lots of cases where you have _other_ reasons to choose between
them.
There is no step compare option in python now though. You either convert to set or you
hope the sun doesn't expire before your program finishes.
Or you write the step-compare manually, the way C programmers have done for
decades. Or you find a third-party library or copy and paste code off a Stack Overflow
answer.
Converting to set is easier, and usually the right answer.
The important question is, for those cases where it _isn’t_ the right answer, is “write
it yourself or find it on SO” good enough?
So now all the scientist using Python also need to learn C, or at least be quick to copy, paste, or know the "secret" knowledge of using sets except for their problems it may be much slower.. so they'll also need to know all about pythons implementations of sets, cache misses, and those artificial micro benchmarks that give people a sad so we can't talk about. I'm actually OK with that as an answer as condescending and unprofessional as it may be.
...
...
Sets work on
non-comparable values; step-compare works on non-hashable values.
Step-compare (if implemented right) works on lazy iterables; sets can preserve the order
of the “master”input. They do different things with duplicates. Even when performance is
the only issue (and it’s not dominated by the file system anyway, as it would be in your
case), the characteristics of your specific data are going to make a huge difference. Not
to mention that there are often other compelling reasons to have a set or a sorted list,
at which point the choice is made for you.
So, this already eliminates the possibility of reimplementing set operations with
step-compare as suggested earlier. But it also already makes a compelling case for adding
step-compare functions to Python—but that case needs to be made clearly.
I'm trying to make that case.
No you’re not. You’re trying to make the case that step-compare is better
than set in all uses. Which isn’t true. And it isn’t necessary in the first place.
If you ever find yourself arguing that you know what someone else is thinking better than they do, you should stop right there because you are probably wrong.
...
The case that needs to be made is that it’s good for some cases that we can’t handle
today, and that at least one of those cases matters, and that the current state of affairs
where people have to go figure out how to do it is not good enough.
So someone can drop by in an attempt to help you so long as they come on hand and knee and prove to you that you need their help. I don't think that can ever work. When you start off by saying their use case doesn't matter it's clear you've already made up your mind. It's obvious you've taken this personally and would go through any hurdle to thwart a consideration of my proposal. That's OK, like I said it wasn't for me, it was for other people.
...
...
I've never come across another step and compare.
People do this all over the place. For example look at the merge step of any
merge sort: it’s a step-compare union, possibly in-place, possibly n-ary instead of
binary, but recognizably the same thing.
I thought we were talking about Python and algebraic set operations and I said after my previous work I never needed to..
...
In C++, the stdlib has all of these algorithms, abstracted to work on any forward
Iterators, in two versions that handle duplicates differently, plus in-place versions of
some of them.
I did exactly what you just said earlier I wrote a C extension for this. I suspect I could do it again, but as it would be an investment of time and effort so I thought I would try to get a consensus view. This process and in particular this discussion has convinced me that even if I want to go out of my way to help this is not the community to be doing it in. There are some good people here who are also very technical. You should talk to them because there is such a huge difference between say the response of Tim Peters and yourself that it's like night and day. That's all I have to say.