How to prevent shared memory from being corrupted ?

Problem: Currently, let’s say I create a shared_memory segment using mulitprocessing.shared_memory.SharedMemory <https://docs.python.org/3.10/library/multiprocessing.shared_memory.html> in Process 1 and open the same in Process 2. Then, I try to write some data to the shared memory segment using both the processes, so for me to prevent any race condition (data corruption), either these operations must be atomic, or I should be able to lock / unlock shared memory segment, which I cannot at the moment. I earlier posted a solution <https://mail.python.org/archives/list/python-ideas@python.org/thread/X4AKFFM...> to this problem, which received positive response, but there weren’t many responses to it, despite the fact this problem makes shared_memory practically unusable if there are simultaneous writes. So, the purpose of this post is to have discussion about the solution of the same. Some Solutions: 1. Support for shared semaphores across unrelated processes to lock/unlock shared memory segment. --> More Details <https://mail.python.org/archives/list/python-ideas@python.org/thread/X4AKFFM...> 2. Let the first bit in the shared memory segment be the synchronisation bit used for locking/unlocking. --> A process can only lock the shared memory segment if this bit is unset, and will set this bit after acquiring lock, and similarly unset this bit after unlocking. Although set/unset operations must be atomic. Therefore, the following tools can be used: type __sync_add_and_fetch (type *ptr, type value, ...) type __sync_sub_and_fetch (type *ptr, type value, ...) type __sync_or_and_fetch (type *ptr, type value, ...) Documentation of these can be found at below links: https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html#g_t_005f_005... <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html#g_t_005f_005...> https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html#g_t_005f_0... <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html#g_t_005f_0...> Any other ideas/solutions are very welcome.

On Sun, 26 Jul 2020 at 19:11, Vinay Sharma via Python-ideas <python-ideas@python.org> wrote:
One thing that is worth thinking about is the safety of the API that is put together. A memory segment plus a separate detached semaphore or mutex can be used to build a safe API, but is not itself a safe API. A safe API shouldn't allow writes to the memory segment while the mutex is unlocked, rather than allowing one to build a safe API from the various pieces. (There may / will be lower level primitives that are unsafe). We can look at a lot of the APIs in the Rust community for examples of this sort of thing. Python doesn't have the borrow checker to enforce usage, but we could still work from the same basic principle - given there are multiple processes involved that make it easier to have safe outcomes. For instance, we could have an object representing a memory range that doesn't offer read/write at all, but allows: - either one process write access over the range - or any number of readers read access over the range - allows subdividing the range (so that you can e.g. end one write lock and keep another) For instance, https://doc.rust-lang.org/std/vec/struct.Vec.html#method.split_at_mut is an in-process API that is very similar. -Rob

Hi, Thanks for replying.
Agreed. That’s why I am more inclined to the second solution that I mentioned.
Where will this memory object be stored ? Locking a particular range instead of the whole memory segment will be relatively efficient because processes using different ranges can write simultaneously. Since, this object will also be shared across multiple processes, there must be a safe way to update it. Any thoughts on that ?

On Mon, 27 Jul 2020 at 23:24, Vinay Sharma <vinay04sharma@icloud.com> wrote:
The second approach isn't clearly specified yet: is 'sync' in the name implying a mutex, an RW lock, or dependent on pointers to atomic types (which then becomes a portability challenge in some cases). The C++ atomics documentation you linked to documents a similar but differently named set of methods, so you'll need to clarify the difference you intend.> > For instance, we could have an object representing a memory range that
There are a few options. The most obvious one given that bookkeeping data is required, is to build a separate layer offering this functionality, which uses the now batteries-included SHM facilities as part of its implementation, but doesn't directly surface it.
Locking a particular range instead of the whole memory segment will be relatively efficient because processes using different ranges can write simultaneously.
Since, this object will also be shared across multiple processes, there must be a safe way to update it.
There's a lot of prior art on named locks of various sorts, I'd personally be inclined to give the things a name that can be used across different processes in some form and bootstrap from there.

Python has support for atomic types, I guess: Atomic Int: https://github.com/python/cpython/blob/master/Include/internal/pycore_atomic... Atomic Store: https://github.com/python/cpython/blob/master/Include/internal/pycore_atomic... <https://github.com/python/cpython/blob/master/Include/internal/pycore_atomic...> And, these methods don’t use any locks, they are just atomic operations. So, my approach was to lock the whole shared memory segment at once, and to do that we can store an integer at the beginning of every shared memory segment, which will denote whether this segment is locked (1), or unlocked (0), and atomic operations can be used to update this integer ( 0 -> 1) lock, (1 -> 0) unlock. Although, `wait` function will have to be implemented like in semaphores, which will wait until the segment is free (becomes 0).
Can you please elaborate more on this ? I understand that shared memory will be used to store ranges and whether they are being locked/unlocked, etc. But if multiple process can update this data, then we will also have to think about the synchronisation of this book-keeping data. So, I guess you mean to say that all processes will be allotted shared memory using a separate API/layer, which will take care of book-keeping, and since this separate API/layer will be only responsible for book-keeping, there will be no need to synchronise book-keeping data. But, then the question arises how will unrelated processes communicate with this layer/API to request shared memory. One way could be that a separate process managing this book-keeping could be created, and other process will request access/lock/unlock using this separate process. And the communication between between this layer (separate process) and the other processes (using shared memory) will be using some form of IPC.

Surely you need locks and semaphores that work between processes? Both unix and Windows have these primitives. The atomics are great for lockless changing of single ints, but anything more complex needs locks and semaphores. Surely you do not want to be implementing your own locks with the OS support that works well with OS scheduling? Barry

On Thu, 30 Jul 2020 at 12:57, Vinay Sharma via Python-ideas < python-ideas@python.org> wrote:
You could also use immutables: https://nextjournal.com/schmudde/adventures-in-immutable-python

You don't need locks with immutable objects. Since they're immutable, any operation that usually will mutate the object, generate another immutable instead. The most common example is str: the sum of two strings in Python (and in many other languages) produces a new string. This is usually slower than modifying a mutable object (as atomic types), but they allow you to remove the bottleneck of a lock. See also immutables.Map: https://github.com/MagicStack/immutables

PyArrow Plasma object ids, "sealing" makes an object immutable, pyristent https://arrow.apache.org/docs/python/plasma.html#object-ids https://arrow.apache.org/docs/python/plasma.html#creating-an-object-buffer
https://pypi.org/project/pyrsistent/ also supports immutable structures On Sat, Aug 1, 2020 at 4:44 PM Eric V. Smith <eric@trueblade.com> wrote:

https://docs.dask.org/en/latest/shared.html#known-limitations : the central process, which can be expensive
- The multiprocessing scheduler cannot transfer data directly between worker processes; all data routes through the master process.
... https://distributed.dask.org/en/latest/memory.html#difference-with-dask-comp... (... https://github.com/dask/dask-labextension ) On Sat, Aug 1, 2020 at 7:34 PM Wes Turner <wes.turner@gmail.com> wrote:

I understand that I won’t need locks with immutable objects at some level, but I don’t understand how they can be used to synchronise shared memory segments. For every change in an immutable object, a copy is created which will have a different address. Now, for processes to use this updated object they will have to remap a new address in their address space for them to see any changes, and this remap will have to occur whenever a change takes place, which is obviously not feasible. So, changes in the shared memory segment should be done in the shared memory segment itself, therefore shared memory segments should be mutable.

It's best to avoid those synchronization barriers if possible. If you have all of the data in SHM (RAM) on one node, and you need to notify processes / wait for other workers to be available to perform a task that requires that data, you need a method for IPC: a queue, channel subscriptions, a source/sink, over-frequent polling that's more resilient against dropped messages. (But you only need to scale to one node). There needs to be a shared structure that tracks allocations, right? What does it need to do lookups by. [ [obj_id_or_shm_pointer, [subscribers]] ] Does the existing memory pool solve for that? And there also needs to be an instruction pipeline; a queue/channel/source of messages for each worker or only some workers to process. ... https://distributed.dask.org/en/latest/journey.html https://distributed.dask.org/en/latest/work-stealing.html "Accelerate intra-node IPC with shared memory" https://github.com/dask/dask/issues/6267 On Sun, Aug 2, 2020, 3:21 AM Vinay Sharma <vinay04sharma@icloud.com> wrote:

There's also the possibility to use shared ctypes: https://docs.python.org/3/library/multiprocessing.html#shared-ctypes-objects Operations like += which involve a read and write are not atomic. So if,
On Sat, 1 Aug 2020 at 22:42, Eric V. Smith <eric@trueblade.com> wrote:
This is interesting. What if you want to have a language that uses only immutable objects and garbage collection? Could smart pointers address this problem?

Yes, garbage collection changes the picture entirely, with or without immutable objects. But the original topic was cross-processs shared memory, and I don't know of any cross-process aware garbage collectors that support shared memory. Although such a thing could easily exist without my knowledge. Eric

How is this a different problem than the cache coherency problem? https://en.wikipedia.org/wiki/Cache_coherence Perhaps that's an unhelpful abstraction? This hasn't gone anywhere: https://en.wikipedia.org/wiki/Distributed_shared_memory#Directory_memory_coh... Here's a great comparison chart for message passing vs distributed shared memory: https://en.wikipedia.org/wiki/Distributed_shared_memory#Message_Passing_vs._... Could there be a multiprocessing.MemoryPool that tracks allocations, refcounts, and also locks? A combined approach might have an IPC channel/stream/source/sinks for messages that instruct workers to invalidate/re-fetch object_id/references, but consistency and ordering result in the same issues encountered with the cache coherence problem. Then, what is the best way to enqueue changes to shared global state (in shared memory on one node, in this instance)? (... "Ask HN: Learning about distributed systems?" https://news.ycombinator.com/item?id=23931730 ) A solution for this could help accelerate dask and dask.distributed (which already address many parallel issues in multiprocess and distributed systems in pure python) "Accelerate intra-node IPC with shared memory" https://github.com/dask/dask/issues/6267 On Sun, Aug 2, 2020, 3:11 PM Eric V. Smith <eric@trueblade.com> wrote:

I forgot that there's also Ray: https://github.com/ray-project/ray Ray uses Apache Arrow (and Plasma) under the hood. It seems Plasma was originally developed by Ray team. Don't know how they solve the GC problem. Maybe they disable it.

On Sun, 26 Jul 2020 at 19:11, Vinay Sharma via Python-ideas <python-ideas@python.org> wrote:
One thing that is worth thinking about is the safety of the API that is put together. A memory segment plus a separate detached semaphore or mutex can be used to build a safe API, but is not itself a safe API. A safe API shouldn't allow writes to the memory segment while the mutex is unlocked, rather than allowing one to build a safe API from the various pieces. (There may / will be lower level primitives that are unsafe). We can look at a lot of the APIs in the Rust community for examples of this sort of thing. Python doesn't have the borrow checker to enforce usage, but we could still work from the same basic principle - given there are multiple processes involved that make it easier to have safe outcomes. For instance, we could have an object representing a memory range that doesn't offer read/write at all, but allows: - either one process write access over the range - or any number of readers read access over the range - allows subdividing the range (so that you can e.g. end one write lock and keep another) For instance, https://doc.rust-lang.org/std/vec/struct.Vec.html#method.split_at_mut is an in-process API that is very similar. -Rob

Hi, Thanks for replying.
Agreed. That’s why I am more inclined to the second solution that I mentioned.
Where will this memory object be stored ? Locking a particular range instead of the whole memory segment will be relatively efficient because processes using different ranges can write simultaneously. Since, this object will also be shared across multiple processes, there must be a safe way to update it. Any thoughts on that ?

On Mon, 27 Jul 2020 at 23:24, Vinay Sharma <vinay04sharma@icloud.com> wrote:
The second approach isn't clearly specified yet: is 'sync' in the name implying a mutex, an RW lock, or dependent on pointers to atomic types (which then becomes a portability challenge in some cases). The C++ atomics documentation you linked to documents a similar but differently named set of methods, so you'll need to clarify the difference you intend.> > For instance, we could have an object representing a memory range that
There are a few options. The most obvious one given that bookkeeping data is required, is to build a separate layer offering this functionality, which uses the now batteries-included SHM facilities as part of its implementation, but doesn't directly surface it.
Locking a particular range instead of the whole memory segment will be relatively efficient because processes using different ranges can write simultaneously.
Since, this object will also be shared across multiple processes, there must be a safe way to update it.
There's a lot of prior art on named locks of various sorts, I'd personally be inclined to give the things a name that can be used across different processes in some form and bootstrap from there.

Python has support for atomic types, I guess: Atomic Int: https://github.com/python/cpython/blob/master/Include/internal/pycore_atomic... Atomic Store: https://github.com/python/cpython/blob/master/Include/internal/pycore_atomic... <https://github.com/python/cpython/blob/master/Include/internal/pycore_atomic...> And, these methods don’t use any locks, they are just atomic operations. So, my approach was to lock the whole shared memory segment at once, and to do that we can store an integer at the beginning of every shared memory segment, which will denote whether this segment is locked (1), or unlocked (0), and atomic operations can be used to update this integer ( 0 -> 1) lock, (1 -> 0) unlock. Although, `wait` function will have to be implemented like in semaphores, which will wait until the segment is free (becomes 0).
Can you please elaborate more on this ? I understand that shared memory will be used to store ranges and whether they are being locked/unlocked, etc. But if multiple process can update this data, then we will also have to think about the synchronisation of this book-keeping data. So, I guess you mean to say that all processes will be allotted shared memory using a separate API/layer, which will take care of book-keeping, and since this separate API/layer will be only responsible for book-keeping, there will be no need to synchronise book-keeping data. But, then the question arises how will unrelated processes communicate with this layer/API to request shared memory. One way could be that a separate process managing this book-keeping could be created, and other process will request access/lock/unlock using this separate process. And the communication between between this layer (separate process) and the other processes (using shared memory) will be using some form of IPC.

Surely you need locks and semaphores that work between processes? Both unix and Windows have these primitives. The atomics are great for lockless changing of single ints, but anything more complex needs locks and semaphores. Surely you do not want to be implementing your own locks with the OS support that works well with OS scheduling? Barry

On Thu, 30 Jul 2020 at 12:57, Vinay Sharma via Python-ideas < python-ideas@python.org> wrote:
You could also use immutables: https://nextjournal.com/schmudde/adventures-in-immutable-python

You don't need locks with immutable objects. Since they're immutable, any operation that usually will mutate the object, generate another immutable instead. The most common example is str: the sum of two strings in Python (and in many other languages) produces a new string. This is usually slower than modifying a mutable object (as atomic types), but they allow you to remove the bottleneck of a lock. See also immutables.Map: https://github.com/MagicStack/immutables

PyArrow Plasma object ids, "sealing" makes an object immutable, pyristent https://arrow.apache.org/docs/python/plasma.html#object-ids https://arrow.apache.org/docs/python/plasma.html#creating-an-object-buffer
https://pypi.org/project/pyrsistent/ also supports immutable structures On Sat, Aug 1, 2020 at 4:44 PM Eric V. Smith <eric@trueblade.com> wrote:

https://docs.dask.org/en/latest/shared.html#known-limitations : the central process, which can be expensive
- The multiprocessing scheduler cannot transfer data directly between worker processes; all data routes through the master process.
... https://distributed.dask.org/en/latest/memory.html#difference-with-dask-comp... (... https://github.com/dask/dask-labextension ) On Sat, Aug 1, 2020 at 7:34 PM Wes Turner <wes.turner@gmail.com> wrote:

I understand that I won’t need locks with immutable objects at some level, but I don’t understand how they can be used to synchronise shared memory segments. For every change in an immutable object, a copy is created which will have a different address. Now, for processes to use this updated object they will have to remap a new address in their address space for them to see any changes, and this remap will have to occur whenever a change takes place, which is obviously not feasible. So, changes in the shared memory segment should be done in the shared memory segment itself, therefore shared memory segments should be mutable.

It's best to avoid those synchronization barriers if possible. If you have all of the data in SHM (RAM) on one node, and you need to notify processes / wait for other workers to be available to perform a task that requires that data, you need a method for IPC: a queue, channel subscriptions, a source/sink, over-frequent polling that's more resilient against dropped messages. (But you only need to scale to one node). There needs to be a shared structure that tracks allocations, right? What does it need to do lookups by. [ [obj_id_or_shm_pointer, [subscribers]] ] Does the existing memory pool solve for that? And there also needs to be an instruction pipeline; a queue/channel/source of messages for each worker or only some workers to process. ... https://distributed.dask.org/en/latest/journey.html https://distributed.dask.org/en/latest/work-stealing.html "Accelerate intra-node IPC with shared memory" https://github.com/dask/dask/issues/6267 On Sun, Aug 2, 2020, 3:21 AM Vinay Sharma <vinay04sharma@icloud.com> wrote:

There's also the possibility to use shared ctypes: https://docs.python.org/3/library/multiprocessing.html#shared-ctypes-objects Operations like += which involve a read and write are not atomic. So if,
On Sat, 1 Aug 2020 at 22:42, Eric V. Smith <eric@trueblade.com> wrote:
This is interesting. What if you want to have a language that uses only immutable objects and garbage collection? Could smart pointers address this problem?

Yes, garbage collection changes the picture entirely, with or without immutable objects. But the original topic was cross-processs shared memory, and I don't know of any cross-process aware garbage collectors that support shared memory. Although such a thing could easily exist without my knowledge. Eric

How is this a different problem than the cache coherency problem? https://en.wikipedia.org/wiki/Cache_coherence Perhaps that's an unhelpful abstraction? This hasn't gone anywhere: https://en.wikipedia.org/wiki/Distributed_shared_memory#Directory_memory_coh... Here's a great comparison chart for message passing vs distributed shared memory: https://en.wikipedia.org/wiki/Distributed_shared_memory#Message_Passing_vs._... Could there be a multiprocessing.MemoryPool that tracks allocations, refcounts, and also locks? A combined approach might have an IPC channel/stream/source/sinks for messages that instruct workers to invalidate/re-fetch object_id/references, but consistency and ordering result in the same issues encountered with the cache coherence problem. Then, what is the best way to enqueue changes to shared global state (in shared memory on one node, in this instance)? (... "Ask HN: Learning about distributed systems?" https://news.ycombinator.com/item?id=23931730 ) A solution for this could help accelerate dask and dask.distributed (which already address many parallel issues in multiprocess and distributed systems in pure python) "Accelerate intra-node IPC with shared memory" https://github.com/dask/dask/issues/6267 On Sun, Aug 2, 2020, 3:11 PM Eric V. Smith <eric@trueblade.com> wrote:

I forgot that there's also Ray: https://github.com/ray-project/ray Ray uses Apache Arrow (and Plasma) under the hood. It seems Plasma was originally developed by Ray team. Don't know how they solve the GC problem. Maybe they disable it.
participants (6)
-
Barry Scott
-
Eric V. Smith
-
Marco Sulla
-
Robert Collins
-
Vinay Sharma
-
Wes Turner