Kyle Lahnakoski wrote:
Andrew Barnert wrote: set(b).intersection(a) or set(a) & set(b) or sb = set(b) then [x for x in a if x in sb] and you’re done. They can easily understand why it works. If they want to know why it’s faster, you can easily explain it, and they’ve learned something widely useful. This isn't technically correct. It's not faster. It all depends on the use case which when it contradicts your expectations you just deride as "artifical micro benchmarks". Python isn't just used as a toy scripting language. 50 million elements in a collection is not even large by a long shot at least from where I sit. You can make a case that it's not a good language for that type of problem, say HPC clusters. Or you can tell people to go copy some C code "like people have been doing for decades". That you ask if that is a proper response to your users is very concerning to me. Richard, I can identify with the need to intersect large sets; and I often faced with poor choices when it comes to dealing with them. I love Python, and I would love to use it for my data munging, but the data is too big to fit in memory and Python is too slow. I find it disappointing to convert an elegant expression declaring what I want, like "a-b", into code that declares how to do it, like "set(b).intersection(a)". This disappointment is magnified by the fact someone, somewhere already written code to do set subtraction faster, and I can not use it; my choices are quick ugly code, or a time consuming search for an elegant solution** on the internet. Maybe we are looking for a different type of solution. You seem to be asking for the elegance of Python with the expression optimization (aka query planning) of a database. A database may be able to deliver the speeds you require; it can pack data tighter; it has access to more
On 2019-09-20 9:52 p.m., Richard Higginbotham wrote: processors; it may already leverage SIMD; it can optimize the operation according to how big the data is. I am suggesting a Python container implementation, like sortedcontainers but using a database (maybe Sqlite), may be a solution. Of course, there is the problem of moving data in and out of the database, but maybe that can be amortized over other operations, and made relatively insignificant. There is the delay when translating __sub_() call into SQL, but maybe that is relatively small compared to size of work we are requesting. May you link to your repo where these tests are run? I seem to have lost the link in the long chain of emails on this list. I am considering adding a Sqlite example, if only to prove to myself that it is the slowest option of all. Thank you. ** The sortedcontainers mentioned is an interesting demonstration of how fast Python can get when you are aware of L1 cache effects.
Hi Kyle, The repo is here: https://github.com/ponderpanda/listex I'm still looking into sortedcontainers and dominik's suggestion of numpy. Both seem promising. At the moment, I'm more concerned with in-memory operations. Disk access complicates things, but I think it would be interesting to include those. We've seen how memory access times can turn superior algorithms into inferior ones at scale. It's very likely that could be true when using disk access times. I'm not sure anyone would want to wait for that benchmark to finish though. My suspicion is that a hybrid approach would be the most optimal. It get's more complicated when you consider you could throw in more computers to increase in memory sizes. If you are going to use my routines for speed testing with large data sets I would try simple timers instead of using the bench mark set. It takes much longer to run the tests repeatedly and the timing precision at scale is going to be drowned out by the long run times. I've been considering parallelization / distribution (sharing cpu and memory) and I think I have a pretty good idea of how to do it. I just haven't needed with my work yet. I think set style (hashtables) would be less likely to benefit from parallelization but I haven't looked into it at all yet. Depending on the size of your datasets you might benefit more from that than relational databases.