[Python-ideas] Re: Support for atomic types in Python

13 Sep 2019

      ...
On Sep 13, 2019, at 12:21, Richard Musil  wrote:
...
On Fri, Sep 13, 2019 at 7:28 PM Andrew Barnert via Python-ideas  wrote:
...
First, I’m pretty sure that, contrary to your claims, C++ does not support this. C++ doesn’t even support shared memory out of the box. The third-party Boost library does provide it—as long as you only care about systems that correctly supports POSIX shared memory, and Windows, and as long as you either
I am not sure what is your point here. Shared memory support in C++ was not the OP's point.
The OP’s point wasn’t clear until three rounds of back and forth, but it’s pretty obvious now that what he wants is some kind of multiprocessing.SharedAtomic.

Which means the argument that “you can do it in C++” isn’t that compelling an argument for adding anything to Python’s stdlib as it sounds, because it’s really “You can do it in C++ nonportably with a third-party library”. There’s no reason someone can’t go write a third-party library for Python and publish it on PyPI, and no indication that such a library would need any new support from the language or stdlib.
...
Yes, by using only atomics you cannot get the "frozen" global state, as the values will be changing behind your back, but again, the OP did not claim that this was his goal.
It’s not about “frozen”, it’s about _consistent_. That 1/0 example wasn’t just a theoretical fantasy, it’s a bug I created and then had to deal with in real life C code. Atomically updating separate values cannot be composed into atomically updating a multiple-value state, so you have to carefully design everything to deal with that. Using a single lock is a whole lot simpler.

And if the OP didn’t think about that issue (as most people don’t the first time they try to implement something like this, as I didn’t), then yeah, he wouldn’t ask for a solution to it.

Also, notice that you can already get the exact same behavior (albeit different performance characteristics) that the OP asked for, with the default auto-constructed separate lock per value, and the OP didn’t know that. The answer to “I want a simpler way to do X” when there’s already a simpler way to do X is “Here’s the existing simpler way to do X”, not designing and implementing a second almost-but-not-quite-as-simple way to do X.
...
...
By contrast, just grabbing a lock around every call to `update_stats` and `read_stats` is trivial.
It is trivial and can kill the performance if there is a lot of contention.
And atomics for each value are trivial and can also kill the performance, sometimes even more badly, and can also be incorrect.
...
I can imagine that grabbing the lock can be as fast as an atomic when there is no contention (the lock being implemented by an atomic), but if there is a contention, stopping the thread seems a magnitude higher than locking the bus for one atomic op.
For a single stat, an atomic update will almost certainly be faster. But the OP asked for multiple stats. And for multiple stats, it can be slower, because you end up locking the bus and flushing the cache line multiple times instead of locking and flushing once while spin- or sleep- or kernel-locking another thread. (And that’s even before you add in any costs or whatever you end up having to add in for whatever your consistency requirements are.)
...
We may argue about some particular implementation and its memory access pattern (and memory partitioning and contention), but I guess without knowing all the details about the OP's app, we can just as well argue about the weather.
Sure, but we’re talking about changing Python. The default presumption is to not change it. If we have a scenario that might or might not benefit from a change depending on facts we don’t know about both the proposal and the scenario, we don’t just say “well, it’s possible it could help somehow, so let’s do it”.
...
...
Meanwhile, a single lock plus dozens of nonatomic writes is going to be a lot faster than dozens of atomic writes (especially if you don’t have a library flexible enough to let you do acquire-release semantics instead of fully-atomic semantics), as well as being simpler. Not to mention that having dozens of
atomics likely to be allocated near each other may well mean stalling the cache 5 times for each contention or false contention instead of just once, but with a lock you don’t need to even consider that, much less figure out how to test for it and fix it.
I would expect one would need a quite a lot of atomics to equal one blocked thread, but it is just a gut feeling. Did you do any benchmark in this matter?
On the OP’s code, of course not. On Python code using an as-yet-undesigned feature, again, no. But on a handful of different real-life server stats gathering systems over a few decades in C, I have. The details are obviously different for a 90s dual-core SMP machine (and code that has to gracefully degrade to single-core) vs. an iPhone 11, or for gathering stats for daily reporting vs. gathering stats for updating a load balancer every 2 seconds, and so on. All I can say is that in general, when you have enough contention that naive locking is too slow, you have enough contention that naive atomics are also too slow, and sometimes even worse. (Also, in at least one case, despite our assumption that shm was essential to performance, a UDP-based fallback was actually faster.)

I’m sure there are exceptions to that which I never ran into. But the burden isn’t on me to prove that the OP’s unspecified use case isn’t one of those exceptions, it’s on the OP to show that it is, or at least might be.
...
...
When the lock isn’t acceptable, it’s because there’s too much contention on the lock—and there would have been too much contention on the atomic writes
So we do not differ much in our understanding after all. You just assume that there won't be a (lot of) contention. I do not know. Maybe the OP does.
No, I assume that if there _is_ a lot of contention, you need to solve it one level higher.