Re: Support for atomic types in Python

Hi, I apologise for my poor explanation of what I mean. I should have been more comprehensive in describing the idea. First of all I am only concerned with synchronization across multiple processes only, and not threads. Therefore I never mentioned multi-threading in my original post. I am familiar with GIL and understand that atomicity doesn't make sense across threads. Also, I didn't mean simple variables to be atomic. I just mean that some objects which are frequently being used across processes in multiprocessing can have some atomic operations/methods. Let's take the following example: import time from multiprocessing import Process, Value, Lock def func(val, lock): for i in range(50): time.sleep(0.01) with lock: val.value += 1 if __name__ == '__main__': v = Value('i', 0) lock = Lock() procs = [Process(target=func, args=(v, lock)) for i in range(10)] for p in procs: p.start() for p in procs: p.join() print(v.value) Because as far as I can perceive none of the types or operations you mention have any race conditions in multi-threaded Python code, and for multi-process or async code that is not a problem by design but for data designed to be shared, like memory mapped stuff (and then locks are the only way to go anyway). In the above code snippet, I have to use a lock so as to prevent data races, and using lock is quite straightforward and simple. But in cases where I have to synchronize let's say 5 different values. I will have to use multiple locks for synchronization, as using a single lock will be very slow, and I will have to wait if a lock has been acquired to update any one of the 5 values. Also, as far as I know (might be wrong) Value is stored in shared memory and is therefore very fast also. So, what I am proposing is a similar object to value to which operations like add, sub, xor, etc are atomic. Therefore, I wouldn't have to use lock at all for synchronization. Updates to these values can be made very easily by using calls such as __sync_and_and_fetch(), even when they are stored in a shared memory segment, accessible across processes. On 9 September 2019 at 9:59 AM, "Joao S. O. Bueno" <jsbueno@python.org.br> wrote: May I ask how acquainted you ar with parallel code, including multi-threading, in Python? Because as far as I can perceive none of the types or operations you mention have any race conditions in multi-threaded Python code, and for multi-process or async code that is not a problem by design but for data designed to be shared, like memory mapped stuff (and then locks are the only way to go anyway). Primitive data types like numbers and strings are imutable to startwith, so it is "very impossible" for multi-threading code to be an issue when using then. As for composite native data types like dicts, lists and sets, that could theoretically be left in an inconsistent state, that is already taken care of by the GIL. And even for user designed types, it is a matter of adding a threading lock inside the proper dunder methods, and it also fixes the issue for the users of those types. The only time I had to worry about a lock in a data type in Python (i.e., not a program wide state that is inherent to my problem), was when asking a novice question on "how to rename a dict key" - and even them, it felt a bit overkill. In other words, I may be wrong, but I think you are talking about a non-issue here. On Sun, 8 Sep 2019 at 11:33, Vinay Sharma via Python-ideas <python-ideas@python.org> wrote: Currently, C++ has support for atomic types, on which operations like add, sub, xor, etc can be done atomically, thereby avoiding data races. Having such a support will be very helpful in Python. For instance users won't have to use Locks for synchronising shared variables in case of multiprocessing. And, directly passing these atomic objects would be enough. These objects can have methods such as add, sub, etc to change the underlying variable. I believe this would make python much more easier for users in both trivial and complicated cases, without having to use lock. gcc also provides inbuilt support for some atomic operations. Documentation for the same is available at https://gcc.gnu.org/onlinedocs/gcc-5.2.0/gcc/_005f_005fsync-Builtins.html#g_... Moreover, gcc also has support for Memory Model aware atomic operations documented at https://gcc.gnu.org/onlinedocs/gcc-5.2.0/gcc/_005f_005fatomic-Builtins.html#... _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/S4EV6I... Code of Conduct: http://python.org/psf/codeofconduct/ _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/GYYUC3... Code of Conduct: http://python.org/psf/codeofconduct/

On Mon, Sep 9, 2019 at 12:34 AM Vinay Sharma via Python-ideas <python-ideas@python.org> wrote:
Currently, C++ has support for atomic types, on which operations like add, sub, xor, etc can be done atomically, thereby avoiding data races. Having such a support will be very helpful in Python.
On Mon, Sep 9, 2019 at 6:27 PM Vinay Sharma via Python-ideas <python-ideas@python.org> wrote:
First of all I am only concerned with synchronization across multiple processes only, and not threads. Therefore I never mentioned multi-threading in my original post.
How do C++ atomic types behave across processes? Can you elaborate on that, and how the concept would be applied to Python? ChrisA

On Sep 9, 2019, at 01:23, Vinay Sharma <vinay04sharma@icloud.com> wrote:
Also, I didn't mean simple variables to be atomic. I just mean that some objects which are frequently being used across processes in multiprocessing can have some atomic operations/methods.
It sounds like all you want is an automatically-synchronizing multiprocessing.Value, then? That has very little in common with C++ atomic, so let’s just forget that whole bit. Presumably you know that multiprocessing.Value already has a lock parameter, which does what you want for simple stores and loads. The only thing that’s missing is a way to lock combined operations, like +=. It sounds like you also want to be able to call update methods and operators directly on the Value rather than on its value property, but that actually makes it easier. You can build this yourself pretty easily. The cleanest way is probably to delegate to a non-synchronized Value. Something like this: class SyncValue: def __init__(self, typ, value, lock=None): if lock is None: lock = mp.RLock() self._value = mp.Value(typ, value) self._lock = lock @property def value(self): with self._lock: return self._value.value # etc. def __iadd__(self, other): with self._lock: self._value.value += other return self # likewise for other operators and methods And now your code becomes: import time from multiprocessing import Process from mpsyncvalue import SyncValue def func(val): for i in range(50): time.sleep(0.01) val.value += 1 if __name__ == '__main__': v = SyncValue('i', 0) procs = [Process(target=func, args=(v,)) for i in range(10)] for p in procs: p.start() for p in procs: p.join() print(v.value) I don’t think this is a great design, but if you want it, build it, share it on PyPI, and if other people like it and want it added to the stdlib, then it doesn’t really matter what I think. I suspect you’re going to want a few iterations on the API (and on working out exactly what’s needed for unit tests) before it’s ready for the stdlib anyway. (Remember, once it’s in stdlib, it’s a lot harder to change the API–and you have to wait a year and a half to use the change, and often even longer to force all of your users/deployments to upgrade.)

On Mon, Sep 9, 2019, 09:27 Vinay Sharma via Python-ideas < python-ideas@python.org> wrote:
So, as Andrew suggested, you are talking about synchronization between processes on access to shared data, right? [snip]
It isn't clear how a single lock wouldn't help. Could you provide an example of the more complex case? How would "atomic" multi-proc types help? Also, as far as I know (might be wrong) Value is stored in shared memory
Wouldn't such a type just use locks behind the scenes? Updates
to these values can be made very easily by using calls such as
__sync_and_and_fetch(), even when they are stored in a shared memory segment, accessible across processes.
So you want a concurrency-safe wrapper around multiple operations? I'm unclear why this can't just be done with locks. -eric

Hi Vinay, On Mon, 09 Sep 2019 08:23:48 -0000 Vinay Sharma via Python-ideas <python-ideas@python.org> wrote:
Also, as far as I know (might be wrong) Value is stored in shared memory and is therefore very fast also. So, what I am proposing is a similar object to value to which operations like add, sub, xor, etc are atomic. Therefore, I wouldn't have to use lock at all for synchronization. Updates to these values can be made very easily by using calls such as __sync_and_and_fetch(), even when they are stored in a shared memory segment, accessible across processes.
I'm curious what your use case is. I have never seen multiprocess Python programs operate at this low level of concurrency (though it's generally possible). The common model when using multiprocessing is to dispatch large tasks and let each worker grind on a task on its own. If you spend your time synchronizining access to shared memory (to the point that you need a separate lock per shared variable), perhaps your model of sharing work between processes is not right. In any case, providing C++-like atomic types sounds very low-level in a language like Python. I'm skeptical that it would address more than a couple extremely niche use cases, though I'd be happy to be proven wrong. Regards Antoine.

Hi Antoine, As you said there can be lot of possible use cases for the proposed feature, since there are lot’s of use cases for a lock. I can tell you some cases which I am facing. Let’s say I have a parent process which spawns lots of worker processes to serve some requests. Now the parent process wants to know the statistics about the kind of requests being served by these workers. For, example the average time to serve a request by all workers combined, number of bad requests, number of stalled requests, number of rerouted requests, etc. Now, the worker processes will make updates to these variables, which the parent can report, and accordingly adjust workers. And, instead of locking, it would be much more helpful and easier to use atomic values. The workers can also send data to parent, and then the parent will have the overhead of writing and synchronising, but this wouldn’t be very fast. Whereas, writing directly to shared memory will be much more faster, since Value is stored in shared memory. Another use case I can think of is reference counting, where parent allocates a shared resource, let’s say shared memory, and would want to deallocate it when none of the worker processes are using it. This can be implemented by storing a reference count of the shared memory, and cleaning it when the reference count becomes 0. Now, each such shared resource will be coupled with a reference count variable, and each reference count variable will have a lock with it to prevent data races. Making this reference count variable atomic will prevent passing locks with these variables. GIL helps in preventing this scenario in case of multithreading, but it very well occurs in case of multiprocessing. Although, the type of shared resource changes in case of multiprocessing.

Hi Vinay, I see. I agree agregating statistics is a reasonable use case. Still, maintaining atomic types in the standard library sounds non-trivial for a rather niche feature. Perhaps you want to implement them as a third-party project and publish it on PyPI? multiprocessing already provides shared memory facilities, so you can build up on that. Regards Antoine. On Fri, 13 Sep 2019 19:02:02 +0530 Vinay Sharma via Python-ideas <python-ideas@python.org> wrote:

On 13/09/2019 14:32, Vinay Sharma via Python-ideas wrote:
I don't in principle object to having language support for tricky operations, but I'm a bit concerned that you haven't really defined what atomic types are. I'm worried that you are trying to make something that is still hard look easy, and a lot of naive users will fall for it. I have much the same concern about atomic simple types in C, if it's any comfort.
It could be. It could also be completely the wrong answer if you care about having a consistent set of variables for any given task, or a consistent set of variables across all tasks, or any number of other possibilities. This is the kind of attractive cock-up that worries me, I must admit. -- Rhodri James *-* Kynesim Ltd

On 13/09/2019 17:07, Vinay Sharma wrote:
Please see my second mail in this thread, where I have comprehensively explained what I mean by atomic types. It is nothing but multiprocessing.Value, which can have an Atomic variant to be used across processes.
Doesn't multiprocessing.Value already have a lock option for this exact purpose? How else would your atomic access work? -- Rhodri James *-* Kynesim Ltd

multiprocessing.Value can be synchronised using a lock, but if I have multiple multiprocessing.Value(s) which I want to synchronise between two processes, then I will have to pass a lock for each multiprocessing.Value. Therefore having a multiprocessing.AtomicValue could prove handy in these cases. I have also described some practical use cases wherein I might need to synchronise multiple values between processes.

On 13/09/2019 17:31, Vinay Sharma wrote:
multiprocessing.Value can be synchronised using a lock, but if I have multiple multiprocessing.Value(s) which I want to synchronise between two processes, then I will have to pass a lock for each multiprocessing.Value. Therefore having a multiprocessing.AtomicValue could prove handy in these cases.
I repeat, how does this work? If you want atomicity across processes, you need some kind of lock at some level. -- Rhodri James *-* Kynesim Ltd

If you see my first mail, I have mentioned that gcc has support for atomic builtins, which can be used to support atomic behaviour. You can refer gcc’s documentation for the same at https://gcc.gnu.org/onlinedocs/gcc-5.2.0/gcc/_005f_005fsync-Builtins.html#g_... <https://gcc.gnu.org/onlinedocs/gcc-5.2.0/gcc/_005f_005fsync-Builtins.html#g_...>

On Fri, Sep 13, 2019 at 6:56 PM Rhodri James <rhodri@kynesim.co.uk> wrote:
I repeat, how does this work? If you want atomicity across processes, you need some kind of lock at some level.
For the atomic operations the lock is on the hardware level (i.e. implemented in the CPU, across the mem BUS, cache, etc.), from the user perspective, it looks only like executing a particular (atomic) op-code. ( https://software.intel.com/en-us/node/506090). Compared to the "Lock" which is an OS implementation (which possibly uses the hardware atomic operations internally), but is far more heavy, because it can block the thread. An atomic operation cannot. Richard

On Sep 13, 2019, at 10:11, Richard Musil <risa2000x@gmail.com> wrote:
For the atomic operations the lock is on the hardware level (i.e. implemented in the CPU, across the mem BUS, cache, etc.), from the user perspective, it looks only like executing a particular (atomic) op-code. (https://software.intel.com/en-us/node/506090).
Compared to the "Lock" which is an OS implementation (which possibly uses the hardware atomic operations internally), but is far more heavy, because it can block the thread. An atomic operation cannot.
I think you’re mixing up two levels here. A lock doesn’t have to just be a kernel lock object, and usually isn’t—usually you’d implement it with an atomic counter, and then only fall back to the kernel lock when you detect contention (the atomic compare-exchange failed or compare-increment returned 2 instead of 1). And you may even sleep-lock or spin-lock a few times before falling back to the kernel. So, when there’s not high contention, it’s not actually heavy. It is, however, obviously not lock-free. If you’re building systems that need to guarantee progress for all threads at all time, locks make that difficult. But the tradeoff is that locks are a lot easier to tune, and to compose, than atomic operations. For example, if you want to keep a dozen values in sync, so everyone sees either version N or all of them or version N+1 of all of them, that’s trivial with locks (just grab a single lock around all dozen updates), but a big pain, and very easy to get wrong, with atomic operations. Of course even locks aren’t friendly to compose, just less unfriendly than atomics. For an extreme (but not uncommon) example, let’s say you also need to guarantee that a failing process can only update all 12 values or none of them. But that the proposed mp.AtomicValue wouldn’t help there either.

On Sep 13, 2019, at 09:31, Vinay Sharma via Python-ideas <python-ideas@python.org> wrote:
multiprocessing.Value can be synchronised using a lock, but if I have multiple multiprocessing.Value(s) which I want to synchronise between two processes, then I will have to pass a lock for each multiprocessing.Value.
This isn’t true. You can use the same lock for all of your values. Or you can not pass a lock explicitly and let the mp library create them so you don’t have to worry about it at all.

On Fri, Sep 13, 2019 at 6:27 PM Rhodri James <rhodri@kynesim.co.uk> wrote:
My summary from the thread: The "atomicity" the OP asks for is an equivalent to the C++ <atomic> abstraction (https://en.cppreference.com/w/cpp/header/atomic), which eventually boils down to the platform hardware support. In other words, it is really a hardware feature. The other consequence is it makes only sense on "l-values" (C/C++), so the natural adept for this feature implementation would be ctypes module. As already suggested by the OP, there are actually already ctypes in multiprocessing.sharedctypes which almost do that, but not quite. They use lock for the synchronization, which is what the OP wants to avoid (there is a significant difference between a lock and an atomic operation). What seems a natural way to proceed would be extending mulitprocessing.sharedctypes with 'AtomicValue', 'AtomicArray', etc which will be the atomic counterparts of the already implemented objects in sharedctypes, i.e. the objects for which the access will be handled by platform atomic operations instead of non-atomic (standard) ones. On the side note, when reading the docs about I am not sure I understand what is actually the difference between the objects from multiprocessing and multiprocessing.sharedctypes. Richard

First, I’m pretty sure that, contrary to your claims, C++ does not support this. C++ doesn’t even support shared memory out of the box. The third-party Boost library does provide it—as long as you only care about systems that correctly supports POSIX shared memory, and Windows, and as long as you either don’t care about the fact that your shared memory might be simulated with a memory mapped file or that it might have different persistence than you asked for. And then, whether you can actually allocate atomics inside shared memory and have them actually be atomic depends on whether atomic<T> for every one of your types is guaranteed lock-free as opposed to just usually lock-free, which you have to test for either per-platform/compiler at build time, or at runtime. For most applications, that’s all good enough. But if requiring a not-100%-portable third-party library for both building and runtime, and writing an autoconf test to boot, is good enough for C++, why does your solution need to be in the stdlib rather than a PyPI library in Python? Meanwhile:
It definitely will not make it easier. Using atomics means that your statistics can be out of sync. You could, for example, see up-to-the-nanosecond bad requests count but out-of-date total requests and therefore calculate an incorrect request success rate (for an extreme example, it could be 1/0) and do the wrong thing. You can fix this by coming up with an ordering discipline for updates, and adding an extra value for total updates that you can use to manage the potential error bounds, but this is hardly simple. By contrast, just grabbing a lock around every call to `update_stats` and `read_stats` is trivial.
The workers can also send data to parent, and then the parent will have the overhead of writing and synchronising, but this wouldn’t be very fast.
If the parent is the only writer or user of the data, why does the parent have any overhead for synchronizing? You can just use non-atomic writes, which are much faster, and there are fewer writes too. Of course sending every update over a queue _is_ slow, almost certainly costing a lot more than what you save on not needing locks or atomics, so the overall cost is higher. But not for the reason you seem to think, and if you don’t know where the performance costs actually are, I’m not sure you’re in the right place to start micro-optimizing here. Meanwhile, a single lock plus dozens of nonatomic writes is going to be a lot faster than dozens of atomic writes (especially if you don’t have a library flexible enough to let you do acquire-release semantics instead of fully-atomic semantics), as well as being simpler. Not to mention that having dozens of atomics likely to be allocated near each other may well mean stalling the cache 5 times for each contention or false contention instead of just once, but with a lock you don’t need to even consider that, much less figure out how to test for it and fix it. When the lock isn’t acceptable, it’s because there’s too much contention on the lock—and there would have been too much contention on the atomic writes as well, so that isn’t the solution. There are optimizations you can get from double-buffering the stats pages, using platform-specific calls to atomically sync a page at a time rather than a value at a time, etc., but ultimately you have to reduce that contention. The traditional answer to this is have multiple shm segments and have the parent (or a separate stats-daemon) aggregate them. The simplest version is to take this all the way to a single shm per child. Now you just need a single atomic counter per dump rather than each value being atomic. This is especially nice for a stats use case, where the child may be updating stats hundreds of times per second but the collector is only checking each child every few seconds. (If you want a shiny modern solution instead, this looks like one of the few cases where hardware TM support can probably speed things up relatively easily, which would be a fun project to work on…)

On Fri, 13 Sep 2019 10:27:10 -0700 Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
(If you want a shiny modern solution instead, this looks like one of the few cases where hardware TM support can probably speed things up relatively easily, which would be a fun project to work on…)
Hardware transactional memory, at least the variant implemented by Intel, seems to be largely a disappointment and little software, if any, takes advantage AFAIK. Regards Antoine.

On Sep 13, 2019, at 10:42, Antoine Pitrou <solipsis@pitrou.net> wrote:
Yes, that’s why I said a stats collector is one of the rare cases where it could actually be useful. There aren’t that many cases where it helps, and trying to build higher-level abstractions around it means you end up with something that doesn’t even help most of those few cases.

Am I the only one seeing two copies of each of Andrew Barnert's replies? -- Steven

On Fri, Sep 13, 2019 at 7:28 PM Andrew Barnert via Python-ideas < python-ideas@python.org> wrote:
I am not sure what is your point here. Shared memory support in C++ was not the OP's point. <atomic> support in C++ seems to be there from C++11. I do not know if historically it was always supported, but just a simple smoke test on gcc in linux and MSVC on windows by compiling: ``` #include <atomic> int main(int argc, char** argv) { std::atomic<int> a(0); a++; return a; } ``` gives an expected result - both compilers emit "lock xadd" into the code without any additional effort (gcc with --std=c++11 and MSVC with /std:c++14 since it does not know c++11).
Yes, by using only atomics you cannot get the "frozen" global state, as the values will be changing behind your back, but again, the OP did not claim that this was his goal.
By contrast, just grabbing a lock around every call to `update_stats` and `read_stats` is trivial.
It is trivial and can kill the performance if there is a lot of contention. I can imagine that grabbing the lock can be as fast as an atomic when there is no contention (the lock being implemented by an atomic), but if there is a contention, stopping the thread seems a magnitude higher than locking the bus for one atomic op. We may argue about some particular implementation and its memory access pattern (and memory partitioning and contention), but I guess without knowing all the details about the OP's app, we can just as well argue about the weather.
atomics likely to be allocated near each other may well mean stalling the
I would expect one would need a quite a lot of atomics to equal one blocked thread, but it is just a gut feeling. Did you do any benchmark in this matter?
So we do not differ much in our understanding after all. You just assume that there won't be a (lot of) contention. I do not know. Maybe the OP does. Richard

The OP’s point wasn’t clear until three rounds of back and forth, but it’s pretty obvious now that what he wants is some kind of multiprocessing.SharedAtomic. Which means the argument that “you can do it in C++” isn’t that compelling an argument for adding anything to Python’s stdlib as it sounds, because it’s really “You can do it in C++ nonportably with a third-party library”. There’s no reason someone can’t go write a third-party library for Python and publish it on PyPI, and no indication that such a library would need any new support from the language or stdlib.
Yes, by using only atomics you cannot get the "frozen" global state, as the values will be changing behind your back, but again, the OP did not claim that this was his goal.
It’s not about “frozen”, it’s about _consistent_. That 1/0 example wasn’t just a theoretical fantasy, it’s a bug I created and then had to deal with in real life C code. Atomically updating separate values cannot be composed into atomically updating a multiple-value state, so you have to carefully design everything to deal with that. Using a single lock is a whole lot simpler. And if the OP didn’t think about that issue (as most people don’t the first time they try to implement something like this, as I didn’t), then yeah, he wouldn’t ask for a solution to it. Also, notice that you can already get the exact same behavior (albeit different performance characteristics) that the OP asked for, with the default auto-constructed separate lock per value, and the OP didn’t know that. The answer to “I want a simpler way to do X” when there’s already a simpler way to do X is “Here’s the existing simpler way to do X”, not designing and implementing a second almost-but-not-quite-as-simple way to do X.
By contrast, just grabbing a lock around every call to `update_stats` and `read_stats` is trivial.
It is trivial and can kill the performance if there is a lot of contention.
And atomics for each value are trivial and can also kill the performance, sometimes even more badly, and can also be incorrect.
I can imagine that grabbing the lock can be as fast as an atomic when there is no contention (the lock being implemented by an atomic), but if there is a contention, stopping the thread seems a magnitude higher than locking the bus for one atomic op.
For a single stat, an atomic update will almost certainly be faster. But the OP asked for multiple stats. And for multiple stats, it can be slower, because you end up locking the bus and flushing the cache line multiple times instead of locking and flushing once while spin- or sleep- or kernel-locking another thread. (And that’s even before you add in any costs or whatever you end up having to add in for whatever your consistency requirements are.)
We may argue about some particular implementation and its memory access pattern (and memory partitioning and contention), but I guess without knowing all the details about the OP's app, we can just as well argue about the weather.
Sure, but we’re talking about changing Python. The default presumption is to not change it. If we have a scenario that might or might not benefit from a change depending on facts we don’t know about both the proposal and the scenario, we don’t just say “well, it’s possible it could help somehow, so let’s do it”.
On the OP’s code, of course not. On Python code using an as-yet-undesigned feature, again, no. But on a handful of different real-life server stats gathering systems over a few decades in C, I have. The details are obviously different for a 90s dual-core SMP machine (and code that has to gracefully degrade to single-core) vs. an iPhone 11, or for gathering stats for daily reporting vs. gathering stats for updating a load balancer every 2 seconds, and so on. All I can say is that in general, when you have enough contention that naive locking is too slow, you have enough contention that naive atomics are also too slow, and sometimes even worse. (Also, in at least one case, despite our assumption that shm was essential to performance, a UDP-based fallback was actually faster.) I’m sure there are exceptions to that which I never ran into. But the burden isn’t on me to prove that the OP’s unspecified use case isn’t one of those exceptions, it’s on the OP to show that it is, or at least might be.
When the lock isn’t acceptable, it’s because there’s too much contention on the lock—and there would have been too much contention on the atomic writes
So we do not differ much in our understanding after all. You just assume that there won't be a (lot of) contention. I do not know. Maybe the OP does.
No, I assume that if there _is_ a lot of contention, you need to solve it one level higher.

On Mon, Sep 9, 2019 at 12:34 AM Vinay Sharma via Python-ideas <python-ideas@python.org> wrote:
Currently, C++ has support for atomic types, on which operations like add, sub, xor, etc can be done atomically, thereby avoiding data races. Having such a support will be very helpful in Python.
On Mon, Sep 9, 2019 at 6:27 PM Vinay Sharma via Python-ideas <python-ideas@python.org> wrote:
First of all I am only concerned with synchronization across multiple processes only, and not threads. Therefore I never mentioned multi-threading in my original post.
How do C++ atomic types behave across processes? Can you elaborate on that, and how the concept would be applied to Python? ChrisA

On Sep 9, 2019, at 01:23, Vinay Sharma <vinay04sharma@icloud.com> wrote:
Also, I didn't mean simple variables to be atomic. I just mean that some objects which are frequently being used across processes in multiprocessing can have some atomic operations/methods.
It sounds like all you want is an automatically-synchronizing multiprocessing.Value, then? That has very little in common with C++ atomic, so let’s just forget that whole bit. Presumably you know that multiprocessing.Value already has a lock parameter, which does what you want for simple stores and loads. The only thing that’s missing is a way to lock combined operations, like +=. It sounds like you also want to be able to call update methods and operators directly on the Value rather than on its value property, but that actually makes it easier. You can build this yourself pretty easily. The cleanest way is probably to delegate to a non-synchronized Value. Something like this: class SyncValue: def __init__(self, typ, value, lock=None): if lock is None: lock = mp.RLock() self._value = mp.Value(typ, value) self._lock = lock @property def value(self): with self._lock: return self._value.value # etc. def __iadd__(self, other): with self._lock: self._value.value += other return self # likewise for other operators and methods And now your code becomes: import time from multiprocessing import Process from mpsyncvalue import SyncValue def func(val): for i in range(50): time.sleep(0.01) val.value += 1 if __name__ == '__main__': v = SyncValue('i', 0) procs = [Process(target=func, args=(v,)) for i in range(10)] for p in procs: p.start() for p in procs: p.join() print(v.value) I don’t think this is a great design, but if you want it, build it, share it on PyPI, and if other people like it and want it added to the stdlib, then it doesn’t really matter what I think. I suspect you’re going to want a few iterations on the API (and on working out exactly what’s needed for unit tests) before it’s ready for the stdlib anyway. (Remember, once it’s in stdlib, it’s a lot harder to change the API–and you have to wait a year and a half to use the change, and often even longer to force all of your users/deployments to upgrade.)

On Mon, Sep 9, 2019, 09:27 Vinay Sharma via Python-ideas < python-ideas@python.org> wrote:
So, as Andrew suggested, you are talking about synchronization between processes on access to shared data, right? [snip]
It isn't clear how a single lock wouldn't help. Could you provide an example of the more complex case? How would "atomic" multi-proc types help? Also, as far as I know (might be wrong) Value is stored in shared memory
Wouldn't such a type just use locks behind the scenes? Updates
to these values can be made very easily by using calls such as
__sync_and_and_fetch(), even when they are stored in a shared memory segment, accessible across processes.
So you want a concurrency-safe wrapper around multiple operations? I'm unclear why this can't just be done with locks. -eric

Hi Vinay, On Mon, 09 Sep 2019 08:23:48 -0000 Vinay Sharma via Python-ideas <python-ideas@python.org> wrote:
Also, as far as I know (might be wrong) Value is stored in shared memory and is therefore very fast also. So, what I am proposing is a similar object to value to which operations like add, sub, xor, etc are atomic. Therefore, I wouldn't have to use lock at all for synchronization. Updates to these values can be made very easily by using calls such as __sync_and_and_fetch(), even when they are stored in a shared memory segment, accessible across processes.
I'm curious what your use case is. I have never seen multiprocess Python programs operate at this low level of concurrency (though it's generally possible). The common model when using multiprocessing is to dispatch large tasks and let each worker grind on a task on its own. If you spend your time synchronizining access to shared memory (to the point that you need a separate lock per shared variable), perhaps your model of sharing work between processes is not right. In any case, providing C++-like atomic types sounds very low-level in a language like Python. I'm skeptical that it would address more than a couple extremely niche use cases, though I'd be happy to be proven wrong. Regards Antoine.

Hi Antoine, As you said there can be lot of possible use cases for the proposed feature, since there are lot’s of use cases for a lock. I can tell you some cases which I am facing. Let’s say I have a parent process which spawns lots of worker processes to serve some requests. Now the parent process wants to know the statistics about the kind of requests being served by these workers. For, example the average time to serve a request by all workers combined, number of bad requests, number of stalled requests, number of rerouted requests, etc. Now, the worker processes will make updates to these variables, which the parent can report, and accordingly adjust workers. And, instead of locking, it would be much more helpful and easier to use atomic values. The workers can also send data to parent, and then the parent will have the overhead of writing and synchronising, but this wouldn’t be very fast. Whereas, writing directly to shared memory will be much more faster, since Value is stored in shared memory. Another use case I can think of is reference counting, where parent allocates a shared resource, let’s say shared memory, and would want to deallocate it when none of the worker processes are using it. This can be implemented by storing a reference count of the shared memory, and cleaning it when the reference count becomes 0. Now, each such shared resource will be coupled with a reference count variable, and each reference count variable will have a lock with it to prevent data races. Making this reference count variable atomic will prevent passing locks with these variables. GIL helps in preventing this scenario in case of multithreading, but it very well occurs in case of multiprocessing. Although, the type of shared resource changes in case of multiprocessing.

Hi Vinay, I see. I agree agregating statistics is a reasonable use case. Still, maintaining atomic types in the standard library sounds non-trivial for a rather niche feature. Perhaps you want to implement them as a third-party project and publish it on PyPI? multiprocessing already provides shared memory facilities, so you can build up on that. Regards Antoine. On Fri, 13 Sep 2019 19:02:02 +0530 Vinay Sharma via Python-ideas <python-ideas@python.org> wrote:

On 13/09/2019 14:32, Vinay Sharma via Python-ideas wrote:
I don't in principle object to having language support for tricky operations, but I'm a bit concerned that you haven't really defined what atomic types are. I'm worried that you are trying to make something that is still hard look easy, and a lot of naive users will fall for it. I have much the same concern about atomic simple types in C, if it's any comfort.
It could be. It could also be completely the wrong answer if you care about having a consistent set of variables for any given task, or a consistent set of variables across all tasks, or any number of other possibilities. This is the kind of attractive cock-up that worries me, I must admit. -- Rhodri James *-* Kynesim Ltd

On 13/09/2019 17:07, Vinay Sharma wrote:
Please see my second mail in this thread, where I have comprehensively explained what I mean by atomic types. It is nothing but multiprocessing.Value, which can have an Atomic variant to be used across processes.
Doesn't multiprocessing.Value already have a lock option for this exact purpose? How else would your atomic access work? -- Rhodri James *-* Kynesim Ltd

multiprocessing.Value can be synchronised using a lock, but if I have multiple multiprocessing.Value(s) which I want to synchronise between two processes, then I will have to pass a lock for each multiprocessing.Value. Therefore having a multiprocessing.AtomicValue could prove handy in these cases. I have also described some practical use cases wherein I might need to synchronise multiple values between processes.

On 13/09/2019 17:31, Vinay Sharma wrote:
multiprocessing.Value can be synchronised using a lock, but if I have multiple multiprocessing.Value(s) which I want to synchronise between two processes, then I will have to pass a lock for each multiprocessing.Value. Therefore having a multiprocessing.AtomicValue could prove handy in these cases.
I repeat, how does this work? If you want atomicity across processes, you need some kind of lock at some level. -- Rhodri James *-* Kynesim Ltd

If you see my first mail, I have mentioned that gcc has support for atomic builtins, which can be used to support atomic behaviour. You can refer gcc’s documentation for the same at https://gcc.gnu.org/onlinedocs/gcc-5.2.0/gcc/_005f_005fsync-Builtins.html#g_... <https://gcc.gnu.org/onlinedocs/gcc-5.2.0/gcc/_005f_005fsync-Builtins.html#g_...>

On Fri, Sep 13, 2019 at 6:56 PM Rhodri James <rhodri@kynesim.co.uk> wrote:
I repeat, how does this work? If you want atomicity across processes, you need some kind of lock at some level.
For the atomic operations the lock is on the hardware level (i.e. implemented in the CPU, across the mem BUS, cache, etc.), from the user perspective, it looks only like executing a particular (atomic) op-code. ( https://software.intel.com/en-us/node/506090). Compared to the "Lock" which is an OS implementation (which possibly uses the hardware atomic operations internally), but is far more heavy, because it can block the thread. An atomic operation cannot. Richard

On Sep 13, 2019, at 10:11, Richard Musil <risa2000x@gmail.com> wrote:
For the atomic operations the lock is on the hardware level (i.e. implemented in the CPU, across the mem BUS, cache, etc.), from the user perspective, it looks only like executing a particular (atomic) op-code. (https://software.intel.com/en-us/node/506090).
Compared to the "Lock" which is an OS implementation (which possibly uses the hardware atomic operations internally), but is far more heavy, because it can block the thread. An atomic operation cannot.
I think you’re mixing up two levels here. A lock doesn’t have to just be a kernel lock object, and usually isn’t—usually you’d implement it with an atomic counter, and then only fall back to the kernel lock when you detect contention (the atomic compare-exchange failed or compare-increment returned 2 instead of 1). And you may even sleep-lock or spin-lock a few times before falling back to the kernel. So, when there’s not high contention, it’s not actually heavy. It is, however, obviously not lock-free. If you’re building systems that need to guarantee progress for all threads at all time, locks make that difficult. But the tradeoff is that locks are a lot easier to tune, and to compose, than atomic operations. For example, if you want to keep a dozen values in sync, so everyone sees either version N or all of them or version N+1 of all of them, that’s trivial with locks (just grab a single lock around all dozen updates), but a big pain, and very easy to get wrong, with atomic operations. Of course even locks aren’t friendly to compose, just less unfriendly than atomics. For an extreme (but not uncommon) example, let’s say you also need to guarantee that a failing process can only update all 12 values or none of them. But that the proposed mp.AtomicValue wouldn’t help there either.

On Sep 13, 2019, at 09:31, Vinay Sharma via Python-ideas <python-ideas@python.org> wrote:
multiprocessing.Value can be synchronised using a lock, but if I have multiple multiprocessing.Value(s) which I want to synchronise between two processes, then I will have to pass a lock for each multiprocessing.Value.
This isn’t true. You can use the same lock for all of your values. Or you can not pass a lock explicitly and let the mp library create them so you don’t have to worry about it at all.

On Fri, Sep 13, 2019 at 6:27 PM Rhodri James <rhodri@kynesim.co.uk> wrote:
My summary from the thread: The "atomicity" the OP asks for is an equivalent to the C++ <atomic> abstraction (https://en.cppreference.com/w/cpp/header/atomic), which eventually boils down to the platform hardware support. In other words, it is really a hardware feature. The other consequence is it makes only sense on "l-values" (C/C++), so the natural adept for this feature implementation would be ctypes module. As already suggested by the OP, there are actually already ctypes in multiprocessing.sharedctypes which almost do that, but not quite. They use lock for the synchronization, which is what the OP wants to avoid (there is a significant difference between a lock and an atomic operation). What seems a natural way to proceed would be extending mulitprocessing.sharedctypes with 'AtomicValue', 'AtomicArray', etc which will be the atomic counterparts of the already implemented objects in sharedctypes, i.e. the objects for which the access will be handled by platform atomic operations instead of non-atomic (standard) ones. On the side note, when reading the docs about I am not sure I understand what is actually the difference between the objects from multiprocessing and multiprocessing.sharedctypes. Richard

First, I’m pretty sure that, contrary to your claims, C++ does not support this. C++ doesn’t even support shared memory out of the box. The third-party Boost library does provide it—as long as you only care about systems that correctly supports POSIX shared memory, and Windows, and as long as you either don’t care about the fact that your shared memory might be simulated with a memory mapped file or that it might have different persistence than you asked for. And then, whether you can actually allocate atomics inside shared memory and have them actually be atomic depends on whether atomic<T> for every one of your types is guaranteed lock-free as opposed to just usually lock-free, which you have to test for either per-platform/compiler at build time, or at runtime. For most applications, that’s all good enough. But if requiring a not-100%-portable third-party library for both building and runtime, and writing an autoconf test to boot, is good enough for C++, why does your solution need to be in the stdlib rather than a PyPI library in Python? Meanwhile:
It definitely will not make it easier. Using atomics means that your statistics can be out of sync. You could, for example, see up-to-the-nanosecond bad requests count but out-of-date total requests and therefore calculate an incorrect request success rate (for an extreme example, it could be 1/0) and do the wrong thing. You can fix this by coming up with an ordering discipline for updates, and adding an extra value for total updates that you can use to manage the potential error bounds, but this is hardly simple. By contrast, just grabbing a lock around every call to `update_stats` and `read_stats` is trivial.
The workers can also send data to parent, and then the parent will have the overhead of writing and synchronising, but this wouldn’t be very fast.
If the parent is the only writer or user of the data, why does the parent have any overhead for synchronizing? You can just use non-atomic writes, which are much faster, and there are fewer writes too. Of course sending every update over a queue _is_ slow, almost certainly costing a lot more than what you save on not needing locks or atomics, so the overall cost is higher. But not for the reason you seem to think, and if you don’t know where the performance costs actually are, I’m not sure you’re in the right place to start micro-optimizing here. Meanwhile, a single lock plus dozens of nonatomic writes is going to be a lot faster than dozens of atomic writes (especially if you don’t have a library flexible enough to let you do acquire-release semantics instead of fully-atomic semantics), as well as being simpler. Not to mention that having dozens of atomics likely to be allocated near each other may well mean stalling the cache 5 times for each contention or false contention instead of just once, but with a lock you don’t need to even consider that, much less figure out how to test for it and fix it. When the lock isn’t acceptable, it’s because there’s too much contention on the lock—and there would have been too much contention on the atomic writes as well, so that isn’t the solution. There are optimizations you can get from double-buffering the stats pages, using platform-specific calls to atomically sync a page at a time rather than a value at a time, etc., but ultimately you have to reduce that contention. The traditional answer to this is have multiple shm segments and have the parent (or a separate stats-daemon) aggregate them. The simplest version is to take this all the way to a single shm per child. Now you just need a single atomic counter per dump rather than each value being atomic. This is especially nice for a stats use case, where the child may be updating stats hundreds of times per second but the collector is only checking each child every few seconds. (If you want a shiny modern solution instead, this looks like one of the few cases where hardware TM support can probably speed things up relatively easily, which would be a fun project to work on…)

On Fri, 13 Sep 2019 10:27:10 -0700 Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
(If you want a shiny modern solution instead, this looks like one of the few cases where hardware TM support can probably speed things up relatively easily, which would be a fun project to work on…)
Hardware transactional memory, at least the variant implemented by Intel, seems to be largely a disappointment and little software, if any, takes advantage AFAIK. Regards Antoine.

On Sep 13, 2019, at 10:42, Antoine Pitrou <solipsis@pitrou.net> wrote:
Yes, that’s why I said a stats collector is one of the rare cases where it could actually be useful. There aren’t that many cases where it helps, and trying to build higher-level abstractions around it means you end up with something that doesn’t even help most of those few cases.

Am I the only one seeing two copies of each of Andrew Barnert's replies? -- Steven

On Fri, Sep 13, 2019 at 7:28 PM Andrew Barnert via Python-ideas < python-ideas@python.org> wrote:
I am not sure what is your point here. Shared memory support in C++ was not the OP's point. <atomic> support in C++ seems to be there from C++11. I do not know if historically it was always supported, but just a simple smoke test on gcc in linux and MSVC on windows by compiling: ``` #include <atomic> int main(int argc, char** argv) { std::atomic<int> a(0); a++; return a; } ``` gives an expected result - both compilers emit "lock xadd" into the code without any additional effort (gcc with --std=c++11 and MSVC with /std:c++14 since it does not know c++11).
Yes, by using only atomics you cannot get the "frozen" global state, as the values will be changing behind your back, but again, the OP did not claim that this was his goal.
By contrast, just grabbing a lock around every call to `update_stats` and `read_stats` is trivial.
It is trivial and can kill the performance if there is a lot of contention. I can imagine that grabbing the lock can be as fast as an atomic when there is no contention (the lock being implemented by an atomic), but if there is a contention, stopping the thread seems a magnitude higher than locking the bus for one atomic op. We may argue about some particular implementation and its memory access pattern (and memory partitioning and contention), but I guess without knowing all the details about the OP's app, we can just as well argue about the weather.
atomics likely to be allocated near each other may well mean stalling the
I would expect one would need a quite a lot of atomics to equal one blocked thread, but it is just a gut feeling. Did you do any benchmark in this matter?
So we do not differ much in our understanding after all. You just assume that there won't be a (lot of) contention. I do not know. Maybe the OP does. Richard

The OP’s point wasn’t clear until three rounds of back and forth, but it’s pretty obvious now that what he wants is some kind of multiprocessing.SharedAtomic. Which means the argument that “you can do it in C++” isn’t that compelling an argument for adding anything to Python’s stdlib as it sounds, because it’s really “You can do it in C++ nonportably with a third-party library”. There’s no reason someone can’t go write a third-party library for Python and publish it on PyPI, and no indication that such a library would need any new support from the language or stdlib.
Yes, by using only atomics you cannot get the "frozen" global state, as the values will be changing behind your back, but again, the OP did not claim that this was his goal.
It’s not about “frozen”, it’s about _consistent_. That 1/0 example wasn’t just a theoretical fantasy, it’s a bug I created and then had to deal with in real life C code. Atomically updating separate values cannot be composed into atomically updating a multiple-value state, so you have to carefully design everything to deal with that. Using a single lock is a whole lot simpler. And if the OP didn’t think about that issue (as most people don’t the first time they try to implement something like this, as I didn’t), then yeah, he wouldn’t ask for a solution to it. Also, notice that you can already get the exact same behavior (albeit different performance characteristics) that the OP asked for, with the default auto-constructed separate lock per value, and the OP didn’t know that. The answer to “I want a simpler way to do X” when there’s already a simpler way to do X is “Here’s the existing simpler way to do X”, not designing and implementing a second almost-but-not-quite-as-simple way to do X.
By contrast, just grabbing a lock around every call to `update_stats` and `read_stats` is trivial.
It is trivial and can kill the performance if there is a lot of contention.
And atomics for each value are trivial and can also kill the performance, sometimes even more badly, and can also be incorrect.
I can imagine that grabbing the lock can be as fast as an atomic when there is no contention (the lock being implemented by an atomic), but if there is a contention, stopping the thread seems a magnitude higher than locking the bus for one atomic op.
For a single stat, an atomic update will almost certainly be faster. But the OP asked for multiple stats. And for multiple stats, it can be slower, because you end up locking the bus and flushing the cache line multiple times instead of locking and flushing once while spin- or sleep- or kernel-locking another thread. (And that’s even before you add in any costs or whatever you end up having to add in for whatever your consistency requirements are.)
We may argue about some particular implementation and its memory access pattern (and memory partitioning and contention), but I guess without knowing all the details about the OP's app, we can just as well argue about the weather.
Sure, but we’re talking about changing Python. The default presumption is to not change it. If we have a scenario that might or might not benefit from a change depending on facts we don’t know about both the proposal and the scenario, we don’t just say “well, it’s possible it could help somehow, so let’s do it”.
On the OP’s code, of course not. On Python code using an as-yet-undesigned feature, again, no. But on a handful of different real-life server stats gathering systems over a few decades in C, I have. The details are obviously different for a 90s dual-core SMP machine (and code that has to gracefully degrade to single-core) vs. an iPhone 11, or for gathering stats for daily reporting vs. gathering stats for updating a load balancer every 2 seconds, and so on. All I can say is that in general, when you have enough contention that naive locking is too slow, you have enough contention that naive atomics are also too slow, and sometimes even worse. (Also, in at least one case, despite our assumption that shm was essential to performance, a UDP-based fallback was actually faster.) I’m sure there are exceptions to that which I never ran into. But the burden isn’t on me to prove that the OP’s unspecified use case isn’t one of those exceptions, it’s on the OP to show that it is, or at least might be.
When the lock isn’t acceptable, it’s because there’s too much contention on the lock—and there would have been too much contention on the atomic writes
So we do not differ much in our understanding after all. You just assume that there won't be a (lot of) contention. I do not know. Maybe the OP does.
No, I assume that if there _is_ a lot of contention, you need to solve it one level higher.
participants (10)
-
Andrew Barnert
-
Antoine Pitrou
-
Chris Angelico
-
Dan Sommers
-
Eric Snow
-
Joao S. O. Bueno
-
Rhodri James
-
Richard Musil
-
Steven D'Aprano
-
Vinay Sharma