[Python-ideas] Re: Set operations with Lists

25 Sep 2019


      Kyle Lahnakoski wrote:
...
...
Andrew Barnert wrote:
set(b).intersection(a) or set(a) & set(b) or
sb =
set(b) then [x for x in a if x in sb] and you’re done. They can easily
understand why it works. If they want to know why it’s faster, you can easily explain it,
and they’ve learned something widely useful.
This isn't technically correct. It's not faster. It all depends on the use case which when
it contradicts your expectations you just deride as "artifical micro benchmarks". Python
isn't just used as a toy scripting language. 50 million elements in a collection is not
even large by a long shot at least from where I sit. You can make a case that it's not a
good language for that type of problem, say HPC clusters. Or you can tell people to go
copy some C code "like people have been doing for decades". That you ask if that is a
proper response to your users is very concerning to me.
Richard,
I can identify with the need to intersect large sets; and I often faced 
with poor choices when it comes to dealing with them.  I love Python, 
and I would love to use it for my data munging, but the data is too big 
to fit in memory and Python is too slow.  I find it disappointing to 
convert an elegant expression declaring what I want, like "a-b", into 
code that declares how to do it, like "set(b).intersection(a)".  This 
disappointment is magnified by the fact someone, somewhere already 
written code to do set subtraction faster, and I can not use it; my 
choices are quick ugly code, or a time consuming search for an elegant 
solution** on the internet.
Maybe we are looking for a different type of solution.  You seem to be 
asking for the elegance of Python with the expression optimization (aka 
query planning) of a database.  A database may be able to deliver the 
speeds you require; it can pack data tighter; it has access to more
On 2019-09-20 9:52 p.m., Richard Higginbotham wrote:
processors; it may already leverage SIMD; it can optimize the operation 
according to how big the data is.  I am suggesting a Python container 
implementation, like sortedcontainers but using a database (maybe 
Sqlite), may be a solution.
Of course, there is the problem of moving data in and out of the 
database, but maybe that can be amortized over other operations, and 
made relatively insignificant.  There is the delay when translating 
__sub_() call into SQL, but maybe that is  relatively small compared to 
size of work we are requesting.
May you link to your repo where these tests are run?  I seem to have 
lost the link in the long chain of emails on this list.  I am 
considering adding a Sqlite example, if only to prove to myself that it 
is the slowest option of all.
Thank you.
** The sortedcontainers mentioned is an interesting demonstration of how 
fast Python can get when you are aware of L1 cache effects.
Hi Kyle,
The repo is here: https://github.com/ponderpanda/listex
I'm still looking into sortedcontainers and dominik's suggestion of numpy. Both seem promising. At the moment, I'm more concerned with in-memory operations. Disk access complicates things, but I think it would be interesting to include those. We've seen how memory access times can turn superior algorithms into inferior ones at scale. It's very likely that could be true when using disk access times. I'm not sure anyone would want to wait for that benchmark to finish though. My suspicion is that a hybrid approach would be the most optimal. It get's more complicated when you consider you could throw in more computers to increase in memory sizes. 
If you are going to use my routines for speed testing with large data sets I would try simple timers instead of using the bench mark set. It takes much longer to run the tests repeatedly and the timing precision at scale is going to be drowned out by the long run times. I've been considering parallelization / distribution (sharing cpu and memory) and I think I have a pretty good idea of how to do it. I just haven't needed with my work yet. I think set style (hashtables) would be less likely to benefit from parallelization but I haven't looked into it at all yet. Depending on the size of your datasets you might benefit more from that than relational databases.