PEP 554 v3 (new interpreters module)
I've updated PEP 554 in response to feedback. (thanks all!) There are a few unresolved points (some of them added to the Open Questions section), but the current PEP has changed enough that I wanted to get it out there first. Notably changed: * the API relative to object passing has changed somewhat drastically (hopefully simpler and easier to understand), replacing "FIFO" with "channel" * added an examples section * added an open questions section * added a rejected ideas section * added more items to the deferred functionality section * the rationale section has moved down below the examples Please let me know what you think. I'm especially interested in feedback about the channels. Thanks! -eric ++++++++++++++++++++++++++++++++++++++++++++++++ PEP: 554 Title: Multiple Interpreters in the Stdlib Author: Eric Snow <ericsnowcurrently@gmail.com> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 2017-09-05 Python-Version: 3.7 Post-History: Abstract ======== CPython has supported subinterpreters, with increasing levels of support, since version 1.5. The feature has been available via the C-API. [c-api]_ Subinterpreters operate in `relative isolation from one another <Interpreter Isolation_>`_, which provides the basis for an `alternative concurrency model <Concurrency_>`_. This proposal introduces the stdlib ``interpreters`` module. The module will be `provisional <Provisional Status_>`_. It exposes the basic functionality of subinterpreters already provided by the C-API. Proposal ======== The ``interpreters`` module will be added to the stdlib. It will provide a high-level interface to subinterpreters and wrap the low-level ``_interpreters`` module. The proposed API is inspired by the ``threading`` module. See the `Examples`_ section for concrete usage and use cases. API for interpreters -------------------- The module provides the following functions: ``list_all()``:: Return a list of all existing interpreters. ``get_current()``:: Return the currently running interpreter. ``create()``:: Initialize a new Python interpreter and return it. The interpreter will be created in the current thread and will remain idle until something is run in it. The interpreter may be used in any thread and will run in whichever thread calls ``interp.run()``. The module also provides the following class: ``Interpreter(id)``:: id: The interpreter's ID (read-only). is_running(): Return whether or not the interpreter is currently executing code. Calling this on the current interpreter will always return True. destroy(): Finalize and destroy the interpreter. This may not be called on an already running interpreter. Doing so results in a RuntimeError. run(source_str, /, **shared): Run the provided Python source code in the interpreter. Any keyword arguments are added to the interpreter's execution namespace. If any of the values are not supported for sharing between interpreters then RuntimeError gets raised. Currently only channels (see "create_channel()" below) are supported. This may not be called on an already running interpreter. Doing so results in a RuntimeError. A "run()" call is quite similar to any other function call. Once it completes, the code that called "run()" continues executing (in the original interpreter). Likewise, if there is any uncaught exception, it propagates into the code where "run()" was called. The big difference is that "run()" executes the code in an entirely different interpreter, with entirely separate state. The state of the current interpreter in the current OS thread is swapped out with the state of the target interpreter (the one that will execute the code). When the target finishes executing, the original interpreter gets swapped back in and its execution resumes. So calling "run()" will effectively cause the current Python thread to pause. Sometimes you won't want that pause, in which case you should make the "run()" call in another thread. To do so, add a function that calls "run()" and then run that function in a normal "threading.Thread". Note that interpreter's state is never reset, neither before "run()" executes the code nor after. Thus the interpreter state is preserved between calls to "run()". This includes "sys.modules", the "builtins" module, and the internal state of C extension modules. Also note that "run()" executes in the namespace of the "__main__" module, just like scripts, the REPL, "-m", and "-c". Just as the interpreter's state is not ever reset, the "__main__" module is never reset. You can imagine concatenating the code from each "run()" call into one long script. This is the same as how the REPL operates. Supported code: source text. API for sharing data -------------------- The mechanism for passing objects between interpreters is through channels. A channel is a simplex FIFO similar to a pipe. The main difference is that channels can be associated with zero or more interpreters on either end. Unlike queues, which are also many-to-many, channels have no buffer. ``create_channel()``:: Create a new channel and return (recv, send), the RecvChannel and SendChannel corresponding to the ends of the channel. The channel is not closed and destroyed (i.e. garbage-collected) until the number of associated interpreters returns to 0. An interpreter gets associated with a channel by calling its "send()" or "recv()" method. That association gets dropped by calling "close()" on the channel. Both ends of the channel are supported "shared" objects (i.e. may be safely shared by different interpreters. Thus they may be passed as keyword arguments to "Interpreter.run()". ``list_all_channels()``:: Return a list of all open (RecvChannel, SendChannel) pairs. ``RecvChannel(id)``:: The receiving end of a channel. An interpreter may use this to receive objects from another interpreter. At first only bytes will be supported. id: The channel's unique ID. interpreters: The list of associated interpreters (those that have called the "recv()" method). __next__(): Return the next object from the channel. If none have been sent then wait until the next send. recv(): Return the next object from the channel. If none have been sent then wait until the next send. If the channel has been closed then EOFError is raised. recv_nowait(default=None): Return the next object from the channel. If none have been sent then return the default. If the channel has been closed then EOFError is raised. close(): No longer associate the current interpreter with the channel (on the receiving end). This is a noop if the interpreter isn't already associated. Once an interpreter is no longer associated with the channel, subsequent (or current) send() and recv() calls from that interpreter will raise EOFError. Once number of associated interpreters on both ends drops to 0, the channel is actually marked as closed. The Python runtime will garbage collect all closed channels. Note that "close()" is automatically called when it is no longer used in the current interpreter. This operation is idempotent. Return True if the current interpreter was still associated with the receiving end of the channel and False otherwise. ``SendChannel(id)``:: The sending end of a channel. An interpreter may use this to send objects to another interpreter. At first only bytes will be supported. id: The channel's unique ID. interpreters: The list of associated interpreters (those that have called the "send()" method). send(obj): Send the object to the receiving end of the channel. Wait until the object is received. If the channel does not support the object then TypeError is raised. Currently only bytes are supported. If the channel has been closed then EOFError is raised. send_nowait(obj): Send the object to the receiving end of the channel. If the object is received then return True. Otherwise return False. If the channel does not support the object then TypeError is raised. If the channel has been closed then EOFError is raised. close(): No longer associate the current interpreter with the channel (on the sending end). This is a noop if the interpreter isn't already associated. Once an interpreter is no longer associated with the channel, subsequent (or current) send() and recv() calls from that interpreter will raise EOFError. Once number of associated interpreters on both ends drops to 0, the channel is actually marked as closed. The Python runtime will garbage collect all closed channels. Note that "close()" is automatically called when it is no longer used in the current interpreter. This operation is idempotent. Return True if the current interpreter was still associated with the sending end of the channel and False otherwise. Examples ======== Run isolated code ----------------- :: interp = interpreters.create() print('before') interp.run('print("during")') print('after') Run in a thread --------------- :: interp = interpreters.create() def run(): interp.run('print("during")') t = threading.Thread(target=run) print('before') t.start() print('after') Pre-populate an interpreter --------------------------- :: interp = interpreters.create() interp.run("""if True: import some_lib import an_expensive_module some_lib.set_up() """) wait_for_request() interp.run("""if True: some_lib.handle_request() """) Handling an exception --------------------- :: interp = interpreters.create() try: interp.run("""if True: raise KeyError """) except KeyError: print("got the error from the subinterpreter") Synchronize using a channel --------------------------- :: interp = interpreters.create() r, s = interpreters.create_channel() def run(): interp.run("""if True: reader.recv() print("during") reader.close() """, reader=r) t = threading.Thread(target=run) print('before') t.start() print('after') s.send(b'') s.close() Sharing a file descriptor ------------------------- :: interp = interpreters.create() r1, s1 = interpreters.create_channel() r2, s2 = interpreters.create_channel() def run(): interp.run("""if True: fd = int.from_bytes( reader.recv(), 'big') for line in os.fdopen(fd): print(line) writer.send(b'') """, reader=r1, writer=s2) t = threading.Thread(target=run) t.start() with open('spamspamspam') as infile: fd = infile.fileno().to_bytes(1, 'big') s.send(fd) r.recv() Passing objects via pickle -------------------------- :: interp = interpreters.create() r, s = interpreters.create_channel() interp.run("""if True: import pickle """, reader=r) def run(): interp.run("""if True: data = reader.recv() while data: obj = pickle.loads(data) do_something(obj) data = reader.recv() reader.close() """, reader=r) t = threading.Thread(target=run) t.start() for obj in input: data = pickle.dumps(obj) s.send(data) s.send(b'') Rationale ========= Running code in multiple interpreters provides a useful level of isolation within the same process. This can be leveraged in number of ways. Furthermore, subinterpreters provide a well-defined framework in which such isolation may extended. CPython has supported subinterpreters, with increasing levels of support, since version 1.5. While the feature has the potential to be a powerful tool, subinterpreters have suffered from neglect because they are not available directly from Python. Exposing the existing functionality in the stdlib will help reverse the situation. This proposal is focused on enabling the fundamental capability of multiple isolated interpreters in the same Python process. This is a new area for Python so there is relative uncertainly about the best tools to provide as companions to subinterpreters. Thus we minimize the functionality we add in the proposal as much as possible. Concerns -------- * "subinterpreters are not worth the trouble" Some have argued that subinterpreters do not add sufficient benefit to justify making them an official part of Python. Adding features to the language (or stdlib) has a cost in increasing the size of the language. So it must pay for itself. In this case, subinterpreters provide a novel concurrency model focused on isolated threads of execution. Furthermore, they present an opportunity for changes in CPython that will allow simulateous use of multiple CPU cores (currently prevented by the GIL). Alternatives to subinterpreters include threading, async, and multiprocessing. Threading is limited by the GIL and async isn't the right solution for every problem (nor for every person). Multiprocessing is likewise valuable in some but not all situations. Direct IPC (rather than via the multiprocessing module) provides similar benefits but with the same caveat. Notably, subinterpreters are not intended as a replacement for any of the above. Certainly they overlap in some areas, but the benefits of subinterpreters include isolation and (potentially) performance. In particular, subinterpreters provide a direct route to an alternate concurrency model (e.g. CSP) which has found success elsewhere and will appeal to some Python users. That is the core value that the ``interpreters`` module will provide. * "stdlib support for subinterpreters adds extra burden on C extension authors" In the `Interpreter Isolation`_ section below we identify ways in which isolation in CPython's subinterpreters is incomplete. Most notable is extension modules that use C globals to store internal state. PEP 3121 and PEP 489 provide a solution for most of the problem, but one still remains. [petr-c-ext]_ Until that is resolved, C extension authors will face extra difficulty to support subinterpreters. Consequently, projects that publish extension modules may face an increased maintenance burden as their users start using subinterpreters, where their modules may break. This situation is limited to modules that use C globals (or use libraries that use C globals) to store internal state. Ultimately this comes down to a question of how often it will be a problem in practice: how many projects would be affected, how often their users will be affected, what the additional maintenance burden will be for projects, and what the overall benefit of subinterpreters is to offset those costs. The position of this PEP is that the actual extra maintenance burden will be small and well below the threshold at which subinterpreters are worth it. About Subinterpreters ===================== Shared data ----------- Subinterpreters are inherently isolated (with caveats explained below), in contrast to threads. This enables `a different concurrency model <Concurrency_>`_ than is currently readily available in Python. `Communicating Sequential Processes`_ (CSP) is the prime example. A key component of this approach to concurrency is message passing. So providing a message/object passing mechanism alongside ``Interpreter`` is a fundamental requirement. This proposal includes a basic mechanism upon which more complex machinery may be built. That basic mechanism draws inspiration from pipes, queues, and CSP's channels. [fifo]_ The key challenge here is that sharing objects between interpreters faces complexity due in part to CPython's current memory model. Furthermore, in this class of concurrency, the ideal is that objects only exist in one interpreter at a time. However, this is not practical for Python so we initially constrain supported objects to ``bytes``. There are a number of strategies we may pursue in the future to expand supported objects and object sharing strategies. Note that the complexity of object sharing increases as subinterpreters become more isolated, e.g. after GIL removal. So the mechanism for message passing needs to be carefully considered. Keeping the API minimal and initially restricting the supported types helps us avoid further exposing any underlying complexity to Python users. To make this work, the mutable shared state will be managed by the Python runtime, not by any of the interpreters. Initially we will support only one type of objects for shared state: the channels provided by ``create_channel()``. Channels, in turn, will carefully manage passing objects between interpreters. Interpreter Isolation --------------------- CPython's interpreters are intended to be strictly isolated from each other. Each interpreter has its own copy of all modules, classes, functions, and variables. The same applies to state in C, including in extension modules. The CPython C-API docs explain more. [caveats]_ However, there are ways in which interpreters share some state. First of all, some process-global state remains shared: * file descriptors * builtin types (e.g. dict, bytes) * singletons (e.g. None) * underlying static module data (e.g. functions) for builtin/extension/frozen modules There are no plans to change this. Second, some isolation is faulty due to bugs or implementations that did not take subinterpreters into account. This includes things like extension modules that rely on C globals. [cryptography]_ In these cases bugs should be opened (some are already): * readline module hook functions (http://bugs.python.org/issue4202) * memory leaks on re-init (http://bugs.python.org/issue21387) Finally, some potential isolation is missing due to the current design of CPython. Improvements are currently going on to address gaps in this area: * interpreters share the GIL * interpreters share memory management (e.g. allocators, gc) * GC is not run per-interpreter [global-gc]_ * at-exit handlers are not run per-interpreter [global-atexit]_ * extensions using the ``PyGILState_*`` API are incompatible [gilstate]_ Concurrency ----------- Concurrency is a challenging area of software development. Decades of research and practice have led to a wide variety of concurrency models, each with different goals. Most center on correctness and usability. One class of concurrency models focuses on isolated threads of execution that interoperate through some message passing scheme. A notable example is `Communicating Sequential Processes`_ (CSP), upon which Go's concurrency is based. The isolation inherent to subinterpreters makes them well-suited to this approach. Existing Usage -------------- Subinterpreters are not a widely used feature. In fact, the only documented case of wide-spread usage is `mod_wsgi <https://github.com/GrahamDumpleton/mod_wsgi>`_. On the one hand, this case provides confidence that existing subinterpreter support is relatively stable. On the other hand, there isn't much of a sample size from which to judge the utility of the feature. Provisional Status ================== The new ``interpreters`` module will be added with "provisional" status (see PEP 411). This allows Python users to experiment with the feature and provide feedback while still allowing us to adjust to that feedback. The module will be provisional in Python 3.7 and we will make a decision before the 3.8 release whether to keep it provisional, graduate it, or remove it. Alternate Python Implementations ================================ TBD Open Questions ============== Leaking exceptions across interpreters -------------------------------------- As currently proposed, uncaught exceptions from ``run()`` propagate to the frame that called it. However, this means that exception objects are leaking across the inter-interpreter boundary. Likewise, the frames in the traceback potentially leak. While that might not be a problem currently, it would be a problem once interpreters get better isolation relative to memory management (which is necessary to stop sharing the GIL between interpreters). So the semantics of how the exceptions propagate needs to be resolved. Initial support for buffers in channels --------------------------------------- An alternative to support for bytes in channels in support for read-only buffers (the PEP 3119 kind). Then ``recv()`` would return a memoryview to expose the buffer in a zero-copy way. This is similar to what ``multiprocessing.Connection`` supports. [mp-conn] Switching to such an approach would help resolve questions of how passing bytes through channels will work once we isolate memory management in interpreters. Deferred Functionality ====================== In the interest of keeping this proposal minimal, the following functionality has been left out for future consideration. Note that this is not a judgement against any of said capability, but rather a deferment. That said, each is arguably valid. Interpreter.call() ------------------ It would be convenient to run existing functions in subinterpreters directly. ``Interpreter.run()`` could be adjusted to support this or a ``call()`` method could be added:: Interpreter.call(f, *args, **kwargs) This suffers from the same problem as sharing objects between interpreters via queues. The minimal solution (running a source string) is sufficient for us to get the feature out where it can be explored. timeout arg to pop() and push() ------------------------------- Typically functions that have a ``block`` argument also have a ``timeout`` argument. We can add it later if needed. get_main() ---------- CPython has a concept of a "main" interpreter. This is the initial interpreter created during CPython's runtime initialization. It may be useful to identify the main interpreter. For instance, the main interpreter should not be destroyed. However, for the basic functionality of a high-level API a ``get_main()`` function is not necessary. Furthermore, there is no requirement that a Python implementation have a concept of a main interpreter. So until there's a clear need we'll leave ``get_main()`` out. Interpreter.run_in_thread() --------------------------- This method would make a ``run()`` call for you in a thread. Doing this using only ``threading.Thread`` and ``run()`` is relatively trivial so we've left it out. Synchronization Primitives -------------------------- The ``threading`` module provides a number of synchronization primitives for coordinating concurrent operations. This is especially necessary due to the shared-state nature of threading. In contrast, subinterpreters do not share state. Data sharing is restricted to channels, which do away with the need for explicit synchronization. If any sort of opt-in shared state support is added to subinterpreters in the future, that same effort can introduce synchronization primitives to meet that need. CSP Library ----------- A ``csp`` module would not be a large step away from the functionality provided by this PEP. However, adding such a module is outside the minimalist goals of this proposal. Syntactic Support ----------------- The ``Go`` language provides a concurrency model based on CSP, so it's similar to the concurrency model that subinterpreters support. ``Go`` provides syntactic support, as well several builtin concurrency primitives, to make concurrency a first-class feature. Conceivably, similar syntactic (and builtin) support could be added to Python using subinterpreters. However, that is *way* outside the scope of this PEP! Multiprocessing --------------- The ``multiprocessing`` module could support subinterpreters in the same way it supports threads and processes. In fact, the module's maintainer, Davin Potts, has indicated this is a reasonable feature request. However, it is outside the narrow scope of this PEP. C-extension opt-in/opt-out -------------------------- By using the ``PyModuleDef_Slot`` introduced by PEP 489, we could easily add a mechanism by which C-extension modules could opt out of support for subinterpreters. Then the import machinery, when operating in a subinterpreter, would need to check the module for support. It would raise an ImportError if unsupported. Alternately we could support opting in to subinterpreter support. However, that would probably exclude many more modules (unnecessarily) than the opt-out approach. The scope of adding the ModuleDef slot and fixing up the import machinery is non-trivial, but could be worth it. It all depends on how many extension modules break under subinterpreters. Given the relatively few cases we know of through mod_wsgi, we can leave this for later. Poisoning channels ------------------ CSP has the concept of poisoning a channel. Once a channel has been poisoned, and ``send()`` or ``recv()`` call on it will raise a special exception, effectively ending execution in the interpreter that tried to use the poisoned channel. This could be accomplished by adding a ``poison()`` method to both ends of the channel. The ``close()`` method could work if it had a ``force`` option to force the channel closed. Regardless, these semantics are relatively specialized and can wait. Sending channels over channels ------------------------------ Some advanced usage of subinterpreters could take advantage of the ability to send channels over channels, in addition to bytes. Given that channels will already be multi-interpreter safe, supporting then in ``RecvChannel.recv()`` wouldn't be a big change. However, this can wait until the basic functionality has been ironed out. Reseting __main__ ----------------- As proposed, every call to ``Interpreter.run()`` will execute in the namespace of the interpreter's existing ``__main__`` module. This means that data persists there between ``run()`` calls. Sometimes this isn't desireable and you want to execute in a fresh ``__main__``. Also, you don't necessarily want to leak objects there that you aren't using any more. Solutions include: * a ``create()`` arg to indicate resetting ``__main__`` after each ``run`` call * an ``Interpreter.reset_main`` flag to support opting in or out after the fact * an ``Interpreter.reset_main()`` method to opt in when desired This isn't a critical feature initially. It can wait until later if desirable. Support passing ints in channels -------------------------------- Passing ints around should be fine and ultimately is probably desirable. However, we can get by with serializing them as bytes for now. The goal is a minimal API for the sake of basic functionality at first. File descriptors and sockets in channels ---------------------------------------- Given that file descriptors and sockets are process-global resources, support for passing them through channels is a reasonable idea. They would be a good candidate for the first effort at expanding the types that channels support. They aren't strictly necessary for the initial API. Rejected Ideas ============== Explicit channel association ---------------------------- Interpreters are implicitly associated with channels upon ``recv()`` and ``send()`` calls. They are de-associated with ``close()`` calls. The alternative would be explicit methods. It would be either ``add_channel()`` and ``remove_channel()`` methods on ``Interpreter`` objects or something similar on channel objects. In practice, this level of management shouldn't be necessary for users. So adding more explicit support would only add clutter to the API. Use pipes instead of channels ----------------------------- A pipe would be a simplex FIFO between exactly two interpreters. For most use cases this would be sufficient. It could potentially simplify the implementation as well. However, it isn't a big step to supporting a many-to-many simplex FIFO via channels. Also, with pipes the API ends up being slightly more complicated, requiring naming the pipes. Use queues instead of channels ------------------------------ The main difference between queues and channels is that queues support buffering. This would complicate the blocking semantics of ``recv()`` and ``send()``. Also, queues can be built on top of channels. "enumerate" ----------- The ``list_all()`` function provides the list of all interpreters. In the threading module, which partly inspired the proposed API, the function is called ``enumerate()``. The name is different here to avoid confusing Python users that are not already familiar with the threading API. For them "enumerate" is rather unclear, whereas "list_all" is clear. References ========== .. [c-api] https://docs.python.org/3/c-api/init.html#sub-interpreter-support .. _Communicating Sequential Processes: .. [CSP] https://en.wikipedia.org/wiki/Communicating_sequential_processes https://github.com/futurecore/python-csp .. [fifo] https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Pipe https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Queue https://docs.python.org/3/library/queue.html#module-queue http://stackless.readthedocs.io/en/2.7-slp/library/stackless/channels.html https://golang.org/doc/effective_go.html#sharing http://www.jtolds.com/writing/2016/03/go-channels-are-bad-and-you-should-fee... .. [caveats] https://docs.python.org/3/c-api/init.html#bugs-and-caveats .. [petr-c-ext] https://mail.python.org/pipermail/import-sig/2016-June/001062.html https://mail.python.org/pipermail/python-ideas/2016-April/039748.html .. [cryptography] https://github.com/pyca/cryptography/issues/2299 .. [global-gc] http://bugs.python.org/issue24554 .. [gilstate] https://bugs.python.org/issue10915 http://bugs.python.org/issue15751 .. [global-atexit] https://bugs.python.org/issue6531 .. [mp-conn] https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Conne... Copyright ========= This document has been placed in the public domain.
On 14 September 2017 at 11:44, Eric Snow <ericsnowcurrently@gmail.com> wrote:
I've updated PEP 554 in response to feedback. (thanks all!) There are a few unresolved points (some of them added to the Open Questions section), but the current PEP has changed enough that I wanted to get it out there first.
Notably changed:
* the API relative to object passing has changed somewhat drastically (hopefully simpler and easier to understand), replacing "FIFO" with "channel" * added an examples section * added an open questions section * added a rejected ideas section * added more items to the deferred functionality section * the rationale section has moved down below the examples
Please let me know what you think. I'm especially interested in feedback about the channels. Thanks!
I like the new pipe-like channels API more than the previous named FIFO approach :)
send(obj):
Send the object to the receiving end of the channel. Wait until the object is received. If the channel does not support the object then TypeError is raised. Currently only bytes are supported. If the channel has been closed then EOFError is raised.
I still expect any form of object sharing to hinder your per-interpreter GIL efforts, so restricting the initial implementation to memoryview-only seems more future-proof to me.
Pre-populate an interpreter ---------------------------
::
interp = interpreters.create() interp.run("""if True: import some_lib import an_expensive_module some_lib.set_up() """) wait_for_request() interp.run("""if True: some_lib.handle_request() """)
I find the "if True:"'s sprinkled through the examples distracting, so I'd prefer either: 1. Using textwrap.dedent; or 2. Assigning the code to a module level attribute :: interp = interpreters.create() setup_code = """\ import some_lib import an_expensive_module some_lib.set_up() """ interp.run(setup_code) wait_for_request() handler_code = """\ some_lib.handle_request() """ interp.run(handler_code)
Handling an exception ---------------------
::
interp = interpreters.create() try: interp.run("""if True: raise KeyError """) except KeyError: print("got the error from the subinterpreter")
As with the message passing through channels, I think you'll really want to minimise any kind of implicit object sharing that may interfere with future efforts to make the GIL truly an *interpreter* lock, rather than the global process lock that it is currently. One possible way to approach that would be to make the low level run() API a more Go-style API rather than a Python-style one, and have it return a (result, err) 2-tuple. "err.raise()" would then translate the foreign interpreter's exception into a local interpreter exception, but the *traceback* for that exception would be entirely within the current interpreter.
About Subinterpreters =====================
Shared data -----------
Subinterpreters are inherently isolated (with caveats explained below), in contrast to threads. This enables `a different concurrency model <Concurrency_>`_ than is currently readily available in Python. `Communicating Sequential Processes`_ (CSP) is the prime example.
A key component of this approach to concurrency is message passing. So providing a message/object passing mechanism alongside ``Interpreter`` is a fundamental requirement. This proposal includes a basic mechanism upon which more complex machinery may be built. That basic mechanism draws inspiration from pipes, queues, and CSP's channels. [fifo]_
The key challenge here is that sharing objects between interpreters faces complexity due in part to CPython's current memory model. Furthermore, in this class of concurrency, the ideal is that objects only exist in one interpreter at a time. However, this is not practical for Python so we initially constrain supported objects to ``bytes``. There are a number of strategies we may pursue in the future to expand supported objects and object sharing strategies.
Note that the complexity of object sharing increases as subinterpreters become more isolated, e.g. after GIL removal. So the mechanism for message passing needs to be carefully considered. Keeping the API minimal and initially restricting the supported types helps us avoid further exposing any underlying complexity to Python users.
To make this work, the mutable shared state will be managed by the Python runtime, not by any of the interpreters. Initially we will support only one type of objects for shared state: the channels provided by ``create_channel()``. Channels, in turn, will carefully manage passing objects between interpreters.
Interpreters themselves will also need to be shared objects, as: - they all have access to "interpreters.list_all()" - when we do "interpreters.create_interpreter()", the calling interpreter gets a reference to itself via "interpreters.get_current()" (These shared objects are what I suspect you may end up needing a process global read/write lock to manage, by the way - I think it would be great if you can figure out a way to avoid that, it's just not entirely clear to me what that might look like. I do think you're on the right track by prohibiting the destruction of an interpreter that's currently running, and the destruction of channels that are currently still associated with an interpreter)
Interpreter Isolation ---------------------
This sections is a really nice addition :)
Existing Usage --------------
Subinterpreters are not a widely used feature. In fact, the only documented case of wide-spread usage is `mod_wsgi <https://github.com/GrahamDumpleton/mod_wsgi>`_. On the one hand, this case provides confidence that existing subinterpreter support is relatively stable. On the other hand, there isn't much of a sample size from which to judge the utility of the feature.
Nathaniel pointed out that JEP embeds CPython subinterpreters inside the JVM similar to the way that mod_wsgi embeds them inside Apache httpd: https://github.com/ninia/jep/wiki/How-Jep-Works
Open Questions ==============
Leaking exceptions across interpreters --------------------------------------
As currently proposed, uncaught exceptions from ``run()`` propagate to the frame that called it. However, this means that exception objects are leaking across the inter-interpreter boundary. Likewise, the frames in the traceback potentially leak.
While that might not be a problem currently, it would be a problem once interpreters get better isolation relative to memory management (which is necessary to stop sharing the GIL between interpreters). So the semantics of how the exceptions propagate needs to be resolved.
As noted above, I think you *really* want to avoid leaking exceptions in the initial implementation. A non-exception-based error signaling mechanism would be one way to do that, similar to how the low-level subprocess APIs actually report the return code, which higher level APIs then turn into an exception. resp.raise_for_status() does something similar for HTTP responses in the requests API.
Initial support for buffers in channels ---------------------------------------
An alternative to support for bytes in channels in support for read-only buffers (the PEP 3119 kind). Then ``recv()`` would return a memoryview to expose the buffer in a zero-copy way. This is similar to what ``multiprocessing.Connection`` supports. [mp-conn]
Switching to such an approach would help resolve questions of how passing bytes through channels will work once we isolate memory management in interpreters.
Exactly :)
Reseting __main__ -----------------
As proposed, every call to ``Interpreter.run()`` will execute in the namespace of the interpreter's existing ``__main__`` module. This means that data persists there between ``run()`` calls. Sometimes this isn't desireable and you want to execute in a fresh ``__main__``. Also, you don't necessarily want to leak objects there that you aren't using any more.
Solutions include:
* a ``create()`` arg to indicate resetting ``__main__`` after each ``run`` call * an ``Interpreter.reset_main`` flag to support opting in or out after the fact * an ``Interpreter.reset_main()`` method to opt in when desired
This isn't a critical feature initially. It can wait until later if desirable.
I was going to note that you can already do this: interp.run("globals().clear()") However, that turns out to clear *too* much, since it also clobbers all the __dunder__ attributes that the interpreter needs in a code execution environment. Either way, if you added this, I think it would make more sense as an "importlib.util.reset_globals()" operation, rather than have it be something specific to subinterpreters. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Wed, Sep 13, 2017 at 11:56 PM, Nick Coghlan <ncoghlan@gmail.com> wrote: [..]
send(obj):
Send the object to the receiving end of the channel. Wait until the object is received. If the channel does not support the object then TypeError is raised. Currently only bytes are supported. If the channel has been closed then EOFError is raised.
I still expect any form of object sharing to hinder your per-interpreter GIL efforts, so restricting the initial implementation to memoryview-only seems more future-proof to me.
+1. Working with memoryviews is as convenient as with bytes. Yury
On Sep 13, 2017 9:01 PM, "Nick Coghlan" <ncoghlan@gmail.com> wrote: On 14 September 2017 at 11:44, Eric Snow <ericsnowcurrently@gmail.com> wrote:
send(obj):
Send the object to the receiving end of the channel. Wait until the object is received. If the channel does not support the object then TypeError is raised. Currently only bytes are supported. If the channel has been closed then EOFError is raised.
I still expect any form of object sharing to hinder your per-interpreter GIL efforts, so restricting the initial implementation to memoryview-only seems more future-proof to me. I don't get it. With bytes, you can either share objects or copy them and the user can't tell the difference, so you can change your mind later if you want. But memoryviews require some kind of cross-interpreter strong reference to keep the underlying buffer object alive. So if you want to minimize object sharing, surely bytes are more future-proof.
Handling an exception ---------------------
::
interp = interpreters.create() try: interp.run("""if True: raise KeyError """) except KeyError: print("got the error from the subinterpreter")
As with the message passing through channels, I think you'll really want to minimise any kind of implicit object sharing that may interfere with future efforts to make the GIL truly an *interpreter* lock, rather than the global process lock that it is currently. One possible way to approach that would be to make the low level run() API a more Go-style API rather than a Python-style one, and have it return a (result, err) 2-tuple. "err.raise()" would then translate the foreign interpreter's exception into a local interpreter exception, but the *traceback* for that exception would be entirely within the current interpreter. It would also be reasonable to simply not return any value/exception from run() at all, or maybe just a bool for whether there was an unhandled exception. Any high level API is going to be injecting code on both sides of the interpreter boundary anyway, so it can do whatever exception and traceback translation it wants to.
Reseting __main__ -----------------
As proposed, every call to ``Interpreter.run()`` will execute in the namespace of the interpreter's existing ``__main__`` module. This means that data persists there between ``run()`` calls. Sometimes this isn't desireable and you want to execute in a fresh ``__main__``. Also, you don't necessarily want to leak objects there that you aren't using any more.
Solutions include:
* a ``create()`` arg to indicate resetting ``__main__`` after each ``run`` call * an ``Interpreter.reset_main`` flag to support opting in or out after the fact * an ``Interpreter.reset_main()`` method to opt in when desired
This isn't a critical feature initially. It can wait until later if desirable.
I was going to note that you can already do this: interp.run("globals().clear()") However, that turns out to clear *too* much, since it also clobbers all the __dunder__ attributes that the interpreter needs in a code execution environment. Either way, if you added this, I think it would make more sense as an "importlib.util.reset_globals()" operation, rather than have it be something specific to subinterpreters. This is another point where the API could reasonably say that if you want clean namespaces then you should do that yourself (e.g. by setting up your own globals dict and using it to execute any post-bootstrap code). -n
On 14 September 2017 at 15:27, Nathaniel Smith <njs@pobox.com> wrote:
On Sep 13, 2017 9:01 PM, "Nick Coghlan" <ncoghlan@gmail.com> wrote:
On 14 September 2017 at 11:44, Eric Snow <ericsnowcurrently@gmail.com> wrote:
send(obj):
Send the object to the receiving end of the channel. Wait until the object is received. If the channel does not support the object then TypeError is raised. Currently only bytes are supported. If the channel has been closed then EOFError is raised.
I still expect any form of object sharing to hinder your per-interpreter GIL efforts, so restricting the initial implementation to memoryview-only seems more future-proof to me.
I don't get it. With bytes, you can either share objects or copy them and the user can't tell the difference, so you can change your mind later if you want. But memoryviews require some kind of cross-interpreter strong reference to keep the underlying buffer object alive. So if you want to minimize object sharing, surely bytes are more future-proof.
Not really, because the only way to ensure object separation (i.e no refcounted objects accessible from multiple interpreters at once) with a bytes-based API would be to either: 1. Always copy (eliminating most of the low overhead communications benefits that subinterpreters may offer over multiple processes) 2. Make the bytes implementation more complicated by allowing multiple bytes objects to share the same underlying storage while presenting as distinct objects in different interpreters 3. Make the output on the receiving side not actually a bytes object, but instead a view onto memory owned by another object in a different interpreter (a "memory view", one might say) And yes, using memory views for this does mean defining either a subclass or a mediating object that not only keeps the originating object alive until the receiving memoryview is closed, but also retains a reference to the originating interpreter so that it can switch to it when it needs to manipulate the source object's refcount or call one of the buffer methods. Yury and I are fine with that, since it means that either the sender *or* the receiver can decide to copy the data (e.g. by calling bytes(obj) before sending, or bytes(view) after receiving), and in the meantime, the object holding the cross-interpreter view knows that it needs to switch interpreters (and hence acquire the sending interpreter's GIL) before doing anything with the source object. The reason we're OK with this is that it means that only reading a new message from a channel (i.e creating a cross-interpreter view) or discarding a previously read message (i.e. closing a cross-interpreter view) will be synchronisation points where the receiving interpreter necessarily needs to acquire the sending interpreter's GIL. By contrast, if we allow an actual bytes object to be shared, then either every INCREF or DECREF on that bytes object becomes a synchronisation point, or else we end up needing some kind of secondary per-interpreter refcount where the interpreter doesn't drop its shared reference to the original object in its source interpreter until the internal refcount in the borrowing interpreter drops to zero.
Handling an exception --------------------- It would also be reasonable to simply not return any value/exception from run() at all, or maybe just a bool for whether there was an unhandled exception. Any high level API is going to be injecting code on both sides of the interpreter boundary anyway, so it can do whatever exception and traceback translation it wants to.
So any more detailed response would *have* to come back as a channel message? That sounds like a reasonable option to me, too, especially since module level code doesn't have a return value as such - you can really only say "it raised an exception (and this was the exception it raised)" or "it reached the end of the code without raising an exception". Given that, I think subprocess.run() (with check=False) is the right API precedent here: https://docs.python.org/3/library/subprocess.html#subprocess.run That always returns subprocess.CompletedProcess, and then you can call "cp.check_returncode()" to get it to raise subprocess.CalledProcessError for non-zero return codes. For interpreter.run(), we could keep the initial RunResult *really* simple and only report back: * source: the source code passed to run() * shared: the keyword args passed to run() (name chosen to match functools.partial) * completed: completed execution without raising an exception? (True if yes, False otherwise) Whether or not to report more details for a raised exception, and provide some mechanism to reraise it in the calling interpreter could then be deferred until later. The subprocess.run() comparison does make me wonder whether this might be a more future-proof signature for Interpreter.run() though: def run(source_str, /, *, channels=None): ... That way channels can be a namespace *specifically* for passing in channels, and can be reported as such on RunResult. If we decide to allow arbitrary shared objects in the future, or add flag options like "reraise=True" to reraise exceptions from the subinterpreter in the current interpreter, we'd have that ability, rather than having the entire potential keyword namespace taken up for passing shared objects. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Thu, Sep 14, 2017 at 5:44 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 14 September 2017 at 15:27, Nathaniel Smith <njs@pobox.com> wrote:
I don't get it. With bytes, you can either share objects or copy them and the user can't tell the difference, so you can change your mind later if you want. But memoryviews require some kind of cross-interpreter strong reference to keep the underlying buffer object alive. So if you want to minimize object sharing, surely bytes are more future-proof.
Not really, because the only way to ensure object separation (i.e no refcounted objects accessible from multiple interpreters at once) with a bytes-based API would be to either:
1. Always copy (eliminating most of the low overhead communications benefits that subinterpreters may offer over multiple processes) 2. Make the bytes implementation more complicated by allowing multiple bytes objects to share the same underlying storage while presenting as distinct objects in different interpreters 3. Make the output on the receiving side not actually a bytes object, but instead a view onto memory owned by another object in a different interpreter (a "memory view", one might say)
And yes, using memory views for this does mean defining either a subclass or a mediating object that not only keeps the originating object alive until the receiving memoryview is closed, but also retains a reference to the originating interpreter so that it can switch to it when it needs to manipulate the source object's refcount or call one of the buffer methods.
Yury and I are fine with that, since it means that either the sender *or* the receiver can decide to copy the data (e.g. by calling bytes(obj) before sending, or bytes(view) after receiving), and in the meantime, the object holding the cross-interpreter view knows that it needs to switch interpreters (and hence acquire the sending interpreter's GIL) before doing anything with the source object.
The reason we're OK with this is that it means that only reading a new message from a channel (i.e creating a cross-interpreter view) or discarding a previously read message (i.e. closing a cross-interpreter view) will be synchronisation points where the receiving interpreter necessarily needs to acquire the sending interpreter's GIL.
By contrast, if we allow an actual bytes object to be shared, then either every INCREF or DECREF on that bytes object becomes a synchronisation point, or else we end up needing some kind of secondary per-interpreter refcount where the interpreter doesn't drop its shared reference to the original object in its source interpreter until the internal refcount in the borrowing interpreter drops to zero.
Ah, that makes more sense. I am nervous that allowing arbitrary memoryviews gives a *little* more power than we need or want. I like that the current API can reasonably be emulated using subprocesses -- it opens up the door for backports, compatibility support on language implementations that don't support subinterpreters, direct benchmark comparisons between the two implementation strategies, etc. But if we allow arbitrary memoryviews, then this requires that you can take (a) an arbitrary object, not specified ahead of time, and (b) provide two read-write views on it in separate interpreters such that modifications made in one are immediately visible in the other. Subprocesses can do one or the other -- they can copy arbitrary data, and if you warn them ahead of time when you allocate the buffer, they can do real zero-copy shared memory. But the combination is really difficult. It'd be one thing if this were like a key feature that gave subinterpreters an advantage over subprocesses, but it seems really unlikely to me that a library won't know ahead of time when it's filling in a buffer to be transferred, and if anything it seems like we'd rather not expose read-write shared mappings in any case. It's extremely non-trivial to do right [1]. tl;dr: let's not rule out a useful implementation strategy based on a feature we don't actually need. One alternative would be your option (3) -- you can put bytes in and get memoryviews out, and since bytes objects are immutable it's OK. [1] https://en.wikipedia.org/wiki/Memory_model_(programming)
Handling an exception --------------------- It would also be reasonable to simply not return any value/exception from run() at all, or maybe just a bool for whether there was an unhandled exception. Any high level API is going to be injecting code on both sides of the interpreter boundary anyway, so it can do whatever exception and traceback translation it wants to.
So any more detailed response would *have* to come back as a channel message?
That sounds like a reasonable option to me, too, especially since module level code doesn't have a return value as such - you can really only say "it raised an exception (and this was the exception it raised)" or "it reached the end of the code without raising an exception".
Given that, I think subprocess.run() (with check=False) is the right API precedent here: https://docs.python.org/3/library/subprocess.html#subprocess.run
That always returns subprocess.CompletedProcess, and then you can call "cp.check_returncode()" to get it to raise subprocess.CalledProcessError for non-zero return codes.
For interpreter.run(), we could keep the initial RunResult *really* simple and only report back:
* source: the source code passed to run() * shared: the keyword args passed to run() (name chosen to match functools.partial) * completed: completed execution without raising an exception? (True if yes, False otherwise)
Whether or not to report more details for a raised exception, and provide some mechanism to reraise it in the calling interpreter could then be deferred until later.
The subprocess.run() comparison does make me wonder whether this might be a more future-proof signature for Interpreter.run() though:
def run(source_str, /, *, channels=None): ...
That way channels can be a namespace *specifically* for passing in channels, and can be reported as such on RunResult. If we decide to allow arbitrary shared objects in the future, or add flag options like "reraise=True" to reraise exceptions from the subinterpreter in the current interpreter, we'd have that ability, rather than having the entire potential keyword namespace taken up for passing shared objects.
Would channels be a dict, or...? -n -- Nathaniel J. Smith -- https://vorpus.org
On 15 September 2017 at 12:04, Nathaniel Smith <njs@pobox.com> wrote:
On Thu, Sep 14, 2017 at 5:44 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
The reason we're OK with this is that it means that only reading a new message from a channel (i.e creating a cross-interpreter view) or discarding a previously read message (i.e. closing a cross-interpreter view) will be synchronisation points where the receiving interpreter necessarily needs to acquire the sending interpreter's GIL.
By contrast, if we allow an actual bytes object to be shared, then either every INCREF or DECREF on that bytes object becomes a synchronisation point, or else we end up needing some kind of secondary per-interpreter refcount where the interpreter doesn't drop its shared reference to the original object in its source interpreter until the internal refcount in the borrowing interpreter drops to zero.
Ah, that makes more sense.
I am nervous that allowing arbitrary memoryviews gives a *little* more power than we need or want. I like that the current API can reasonably be emulated using subprocesses -- it opens up the door for backports, compatibility support on language implementations that don't support subinterpreters, direct benchmark comparisons between the two implementation strategies, etc. But if we allow arbitrary memoryviews, then this requires that you can take (a) an arbitrary object, not specified ahead of time, and (b) provide two read-write views on it in separate interpreters such that modifications made in one are immediately visible in the other. Subprocesses can do one or the other -- they can copy arbitrary data, and if you warn them ahead of time when you allocate the buffer, they can do real zero-copy shared memory. But the combination is really difficult.
One constraint we'd want to impose is that the memory view in the receiving interpreter should always be read-only - while we don't currently expose the ability to request that at the Python layer, memoryviews *do* support the creation of read-only views at the C API layer (which then gets reported to Python code via the "view.readonly" attribute). While that change alone is enough to preserve the simplex nature of the channel, it wouldn't be enough to prevent the *sender* from mutating the buffer contents and having that change be visible in the recipient. In that regard it may make sense to maintain both restrictions initially (as you suggested below): only accept bytes on the sending side (to prevent mutation by the sender), and expose that as a read-only memory view on the receiving side (to allow for zero-copy data sharing without allowing mutation by the receiver).
It'd be one thing if this were like a key feature that gave subinterpreters an advantage over subprocesses, but it seems really unlikely to me that a library won't know ahead of time when it's filling in a buffer to be transferred, and if anything it seems like we'd rather not expose read-write shared mappings in any case. It's extremely non-trivial to do right [1].
tl;dr: let's not rule out a useful implementation strategy based on a feature we don't actually need.
Yeah, the description Eric currently has in the PEP is a summary of a much longer suggestion Yury, Neil Schumenauer and I put together while waiting for our flights following the core dev sprint, and the full version had some of these additional constraints on it (most notably the "read-only in the receiving interpreter" one).
One alternative would be your option (3) -- you can put bytes in and get memoryviews out, and since bytes objects are immutable it's OK.
Indeed, I think that will be a sensible starting point. However, I genuinely want to allow for zero-copy sharing of NumPy arrays eventually, as that's where I think this idea gets most interesting: the potential to allow for multiple parallel read operations on a given NumPy array *in Python* (rather than Cython or C) without running afoul of the GIL, and without needing to mess about with the complexities of operating system level IPC.
Handling an exception That way channels can be a namespace *specifically* for passing in channels, and can be reported as such on RunResult. If we decide to allow arbitrary shared objects in the future, or add flag options like "reraise=True" to reraise exceptions from the subinterpreter in the current interpreter, we'd have that ability, rather than having the entire potential keyword namespace taken up for passing shared objects.
Would channels be a dict, or...?
Yeah, it would be a direct replacement for the way the current draft is proposing to use the keywords dict - it would just be a separate dictionary instead. It does occur to me that if we wanted to align with the way the `runpy` module spells that concept, we'd call the option `init_globals`, but I'm thinking it will be better to only allow channels to be passed through directly, and require that everything else be sent through a channel. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Thu, Sep 14, 2017 at 8:44 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Not really, because the only way to ensure object separation (i.e no refcounted objects accessible from multiple interpreters at once) with a bytes-based API would be to either:
1. Always copy (eliminating most of the low overhead communications benefits that subinterpreters may offer over multiple processes) 2. Make the bytes implementation more complicated by allowing multiple bytes objects to share the same underlying storage while presenting as distinct objects in different interpreters 3. Make the output on the receiving side not actually a bytes object, but instead a view onto memory owned by another object in a different interpreter (a "memory view", one might say)
4. Pass Bytes through directly. The only problem of which I'm aware is that when Py_DECREF() triggers Bytes.__del__(), it happens in the current interpreter, which may not be the "owner" (i.e. allocated the object). So the solution would be to make PyBytesType.tp_free() effectively run as a "pending call" under the owner. This would require two things: 1. a new PyBytesObject.owner field (PyInterpreterState *), or a separate owner table, which would be set when the object is passed through a channel 2. a Py_AddPendingCall() that targets a specific interpreter (which I expect would be desirable regardless) Then, when the object has an owner, PyBytesType.tp_free() would add a pending call on the owner to call PyObject_Del() on the Bytes object. The catch is that currently "pending" calls (via Py_AddPendingCall) are run only in the main thread of the main interpreter. We'd need a similar mechanism that targets a specific interpreter .
By contrast, if we allow an actual bytes object to be shared, then either every INCREF or DECREF on that bytes object becomes a synchronisation point, or else we end up needing some kind of secondary per-interpreter refcount where the interpreter doesn't drop its shared reference to the original object in its source interpreter until the internal refcount in the borrowing interpreter drops to zero.
There shouldn't be a need to synchronize on INCREF. If both interpreters have at least 1 reference then either one adding a reference shouldn't be a problem. If only one interpreter has a reference then the other won't be adding any references. If neither has a reference then neither is going to add any references. Perhaps I've missed something. Under what circumstances would INCREF happen while the refcount is 0? On DECREF there shouldn't be a problem except possibly with a small race between decrementing the refcount and checking for a refcount of 0. We could address that several different ways, including allowing the pending call to get queued only once (or being a noop the second time). FWIW, I'm not opposed to the CIV/memoryview approach, but want to make sure we really can't use Bytes before going down that route. -eric
On Mon, Oct 2, 2017 at 9:31 PM, Eric Snow <ericsnowcurrently@gmail.com> wrote:
On DECREF there shouldn't be a problem except possibly with a small race between decrementing the refcount and checking for a refcount of 0. We could address that several different ways, including allowing the pending call to get queued only once (or being a noop the second time).
Alternately, the channel could own a reference and DECREF it in the owning interpreter once the refcount reaches 1. -eric
On 3 October 2017 at 11:31, Eric Snow <ericsnowcurrently@gmail.com> wrote:
There shouldn't be a need to synchronize on INCREF. If both interpreters have at least 1 reference then either one adding a reference shouldn't be a problem. If only one interpreter has a reference then the other won't be adding any references. If neither has a reference then neither is going to add any references. Perhaps I've missed something. Under what circumstances would INCREF happen while the refcount is 0?
The problem relates to the fact that there aren't any memory barriers around CPython's INCREF operations (they're implemented as an ordinary C post-increment operation), so you can get the following scenario: * thread on CPU A has the sole reference (ob_refcnt=1) * thread on CPU B acquires a new reference, but hasn't pushed the updated ob_refcnt value back to the shared memory cache yet * original thread on CPU A drops its reference, *thinks* the refcnt is now zero, and deletes the object * bad things now happen in CPU B as the thread running there tries to use a deleted object :) The GIL currently protects us from this, as switching CPUs requires switching threads, which means the original thread has to release the GIL (flushing all of its state changes to the shared cache), and the new thread has to acquire it (hence refreshing its local cache from the shared one). The need to switch all incref/decref operations over to using atomic thread-safe primitives when removing the GIL is one of the main reasons that attempting to remove the GIL *within* an interpreter is expensive (and why Larry et al are having to explore completely different ref count management strategies for the GILectomy). By contrast, if you rely on a new memoryview variant to mediate all data sharing between interpreters, then you can make sure that *it* is using synchronisation primitives as needed to ensure the required cache coherency across different CPUs, without any negative impacts on regular single interpreter code (which can still rely on the cache coherency guarantees provided by the GIL). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Tue, Oct 3, 2017 at 11:36 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
The problem relates to the fact that there aren't any memory barriers around CPython's INCREF operations (they're implemented as an ordinary C post-increment operation), so you can get the following scenario:
* thread on CPU A has the sole reference (ob_refcnt=1) * thread on CPU B acquires a new reference, but hasn't pushed the updated ob_refcnt value back to the shared memory cache yet * original thread on CPU A drops its reference, *thinks* the refcnt is now zero, and deletes the object * bad things now happen in CPU B as the thread running there tries to use a deleted object :)
From what I see, at no point do we get a refcount of 0, such that
I'm not clear on where we'd run into this problem with channels. Mirroring your scenario: * interpreter A (in thread on CPU A) INCREFs the object (the GIL is still held) * interp A sends the object to the channel * interp B (in thread on CPU B) receives the object from the channel * the new reference is held until interp B DECREFs the object there would be a race on the object being deleted. The only problem I'm aware of (it dawned on me last night), is in the case that the interpreter that created the object gets deleted before the object does. In that case we can't pass the deletion back to the original interpreter. (I don't think this problem is necessarily exclusive to the solution I've proposed for Bytes.) -eric
On Wed, Oct 4, 2017 at 4:51 PM, Eric Snow <ericsnowcurrently@gmail.com> wrote:
On Tue, Oct 3, 2017 at 11:36 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
The problem relates to the fact that there aren't any memory barriers around CPython's INCREF operations (they're implemented as an ordinary C post-increment operation), so you can get the following scenario:
* thread on CPU A has the sole reference (ob_refcnt=1) * thread on CPU B acquires a new reference, but hasn't pushed the updated ob_refcnt value back to the shared memory cache yet * original thread on CPU A drops its reference, *thinks* the refcnt is now zero, and deletes the object * bad things now happen in CPU B as the thread running there tries to use a deleted object :)
I'm not clear on where we'd run into this problem with channels. Mirroring your scenario:
* interpreter A (in thread on CPU A) INCREFs the object (the GIL is still held) * interp A sends the object to the channel * interp B (in thread on CPU B) receives the object from the channel * the new reference is held until interp B DECREFs the object
From what I see, at no point do we get a refcount of 0, such that there would be a race on the object being deleted.
So what you're saying is that when Larry finishes the gilectomy, subinterpreters will work GIL-free too?-)
––Koos The only problem I'm aware of (it dawned on me last night), is in the
case that the interpreter that created the object gets deleted before the object does. In that case we can't pass the deletion back to the original interpreter. (I don't think this problem is necessarily exclusive to the solution I've proposed for Bytes.)
-eric _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ k7hoven%40gmail.com
-- + Koos Zevenhoven + http://twitter.com/k7hoven +
On 4 October 2017 at 23:51, Eric Snow <ericsnowcurrently@gmail.com> wrote:
On Tue, Oct 3, 2017 at 11:36 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
The problem relates to the fact that there aren't any memory barriers around CPython's INCREF operations (they're implemented as an ordinary C post-increment operation), so you can get the following scenario:
* thread on CPU A has the sole reference (ob_refcnt=1) * thread on CPU B acquires a new reference, but hasn't pushed the updated ob_refcnt value back to the shared memory cache yet * original thread on CPU A drops its reference, *thinks* the refcnt is now zero, and deletes the object * bad things now happen in CPU B as the thread running there tries to use a deleted object :)
I'm not clear on where we'd run into this problem with channels. Mirroring your scenario:
* interpreter A (in thread on CPU A) INCREFs the object (the GIL is still held) * interp A sends the object to the channel * interp B (in thread on CPU B) receives the object from the channel * the new reference is held until interp B DECREFs the object
From what I see, at no point do we get a refcount of 0, such that there would be a race on the object being deleted.
Having the sending interpreter do the INCREF just changes the problem to be a memory leak waiting to happen rather than an access-after-free issue, since the problematic non-synchronised scenario then becomes: * thread on CPU A has two references (ob_refcnt=2) * it sends a reference to a thread on CPU B via a channel * thread on CPU A releases its reference (ob_refcnt=1) * updated ob_refcnt value hasn't made it back to the shared memory cache yet * thread on CPU B releases its reference (ob_refcnt=1) * both threads have released their reference, but the refcnt is still 1 -> object leaks! We simply can't have INCREFs and DECREFs happening in different threads without some way of ensuring cache coherency for *both* operations - otherwise we risk either the refcount going to zero when it shouldn't, or *not* going to zero when it should. The current CPython implementation relies on the process global GIL for that purpose, so none of these problems will show up until you start trying to replace that with per-interpreter locks. Free threaded reference counting relies on (expensive) atomic increments & decrements. The cross-interpreter view proposal aims to allow per-interpreter GILs without introducing atomic increments & decrements by instead relying on the view itself to ensure that it's holding the right GIL for the object whose refcount it's manipulating, and the receiving interpreter explicitly closing the view when it's done with it. So while CIVs wouldn't be as easy to use as regular object references: 1. They'd be no harder to use than memoryviews in general 2. They'd structurally ensure that regular object refcounts can still rely on "protected by the GIL" semantics 3. They'd structurally ensure zero performance degradation for regular object refcounts 4. By virtue of being memoryview based, they'd encourage the adoption of interfaces and practices that can be adapted to multiple processes through the use of techniques like shared memory regions and memory mapped files (see http://www.boost.org/doc/libs/1_54_0/doc/html/interprocess/sharedmemorybetwe... for some detailed explanations of how that works, and https://arrow.apache.org/ for an example of ways tools like Pandas can use that to enable zero-copy data sharing)
The only problem I'm aware of (it dawned on me last night), is in the case that the interpreter that created the object gets deleted before the object does. In that case we can't pass the deletion back to the original interpreter. (I don't think this problem is necessarily exclusive to the solution I've proposed for Bytes.)
The cross-interpreter-view idea proposes to deal with that by having the CIV hold a strong reference not only to the sending object (which is already part of the regular memoryview semantics), but *also* to the sending interpreter - that way, neither the sending object nor the sending interpreter can go away until the receiving interpreter closes the view. The refcount-integrity-ensuring sequence of events becomes: 1. Sending interpreter submits the object to the channel 2. Channel creates a CIV with references to the sending interpreter & sending object, and a view on the sending object's memory 3. Receiving interpreter gets the CIV from the channel 4. Receiving interpreter closes the CIV either explicitly or via __del__ (the latter would emit ResourceWarning) 5. CIV switches execution back to the sending interpreter and releases both the memory buffer and the reference to the sending object 6. CIV switches execution back to the receiving interpreter, and releases its reference to the sending interpreter 7. Execution continues in the receiving interpreter Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Tue, Oct 3, 2017 at 8:55 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
I think we need a sharing protocol, not just a flag. We also need to think carefully about that protocol, so that it does not imply unnecessary memory copies. Therefore I think the protocol should be something like the buffer protocol, that allows to acquire and release a set of shared memory areas, but without imposing any semantics onto those memory areas (each type implementing its own semantics). And there needs to be a dedicated reference counting for object shares, so that the original object can be notified when all its shares have vanished.
I've come to agree. :) I actually came to the same conclusion tonight before I'd been able to read through your message carefully. My idea is below. Your suggestion about protecting shared memory areas is something to discuss further, though I'm not sure it's strictly necessary yet (before we stop sharing the GIL). On Wed, Oct 4, 2017 at 7:41 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Having the sending interpreter do the INCREF just changes the problem to be a memory leak waiting to happen rather than an access-after-free issue, since the problematic non-synchronised scenario then becomes:
* thread on CPU A has two references (ob_refcnt=2) * it sends a reference to a thread on CPU B via a channel * thread on CPU A releases its reference (ob_refcnt=1) * updated ob_refcnt value hasn't made it back to the shared memory cache yet * thread on CPU B releases its reference (ob_refcnt=1) * both threads have released their reference, but the refcnt is still 1 -> object leaks!
We simply can't have INCREFs and DECREFs happening in different threads without some way of ensuring cache coherency for *both* operations - otherwise we risk either the refcount going to zero when it shouldn't, or *not* going to zero when it should.
The current CPython implementation relies on the process global GIL for that purpose, so none of these problems will show up until you start trying to replace that with per-interpreter locks.
Free threaded reference counting relies on (expensive) atomic increments & decrements.
Right. I'm not sure why I was missing that, but I'm clear now. Below is a rough idea of what I think may work instead (the result of much tossing and turning in bed*). While we're still sharing a GIL between interpreters: Channel.send(obj): # in interp A incref(obj) if type(obj).tp_share == NULL: raise ValueError("not a shareable type") ch.objects.append(obj) Channel.recv(): # in interp B orig = ch.objects.pop(0) obj = orig.tp_share() return obj bytes.tp_share(): return self After we move to not sharing the GIL between interpreters: Channel.send(obj): # in interp A incref(obj) if type(obj).tp_share == NULL: raise ValueError("not a shareable type") set_owner(obj) # obj.owner or add an obj -> interp entry to global table ch.objects.append(obj) Channel.recv(): # in interp B orig = ch.objects.pop(0) obj = orig.tp_share() set_shared(obj, orig) # add to a global table return obj bytes.tp_share(): obj = blank_bytes(len(self)) obj.ob_sval = self.ob_sval # hand-wavy memory sharing return obj bytes.tp_free(): # under no-shared-GIL: # most of this could be pulled into a macro for re-use orig = lookup_shared(self) if orig != NULL: current = release_LIL() interp = lookup_owner(orig) acquire_LIL(interp) decref(orig) release_LIL(interp) acquire_LIL(current) # clear shared/owner tables # clear/release self.ob_sval free(self) The CIV approach could be facilitated through something like a new SharedBuffer type, or through a separate BufferViewChannel, etc. Most notably, this approach avoids hard-coding specific type support into channels and should work out fine under no-shared-GIL subinterpreters. One nice thing about the tp_share slot is that it makes it much easier (along with C-API for managing the global owned/shared tables) to implement other types that are legal to pass through channels. Such could be provided via extension modules. Numpy arrays could be made to support it, if that's your thing. Antoine could give tp_share to locks and semaphores. :) Of course, any such types would have to ensure that they are actually safe to share between intepreters without a GIL between them... For PEP 554, I'd only propose the tp_share slot and its use in Channel.send()/.recv(). The parts related to global tables and memory sharing and tp_free() wouldn't be necessary until we stop sharing the GIL between interpreters. However, I believe that tp_share would make us ready for that. -eric * I should know by now that some ideas sound better in the middle of the night than they do the next day, but this idea is keeping me awake so I'll risk it! :)
On 5 October 2017 at 18:45, Eric Snow <ericsnowcurrently@gmail.com> wrote:
After we move to not sharing the GIL between interpreters:
Channel.send(obj): # in interp A incref(obj) if type(obj).tp_share == NULL: raise ValueError("not a shareable type") set_owner(obj) # obj.owner or add an obj -> interp entry to global table ch.objects.append(obj)
Channel.recv(): # in interp B orig = ch.objects.pop(0) obj = orig.tp_share() set_shared(obj, orig) # add to a global table return obj
This would be hard to get to work reliably, because "orig.tp_share()" would be running in the receiving interpreter, but all the attributes of "orig" would have been allocated by the sending interpreter. It gets more reliable if it's *Channel.send* that calls tp_share() though, but moving the call to the sending side makes it clear that a tp_share protocol would still need to rely on a more primitive set of "shareable objects" that were the permitted return values from the tp_share call. And that's the real pay-off that comes from defining this in terms of the memoryview protocol: Py_buffer structs *aren't* Python objects, so it's only a regular C struct that gets passed across the interpreter boundary (the reference to the original objects gets carried along passively as part of the CIV - it never gets *used* in the receiving interpreter).
bytes.tp_share(): obj = blank_bytes(len(self)) obj.ob_sval = self.ob_sval # hand-wavy memory sharing return obj
This is effectively reinventing memoryview, while trying to pretend it's an ordinary bytes object. Don't reinvent memoryview :)
bytes.tp_free(): # under no-shared-GIL: # most of this could be pulled into a macro for re-use orig = lookup_shared(self) if orig != NULL: current = release_LIL() interp = lookup_owner(orig) acquire_LIL(interp) decref(orig) release_LIL(interp) acquire_LIL(current) # clear shared/owner tables # clear/release self.ob_sval free(self)
I don't think we should be touching the behaviour of core builtins solely to enable message passing to subinterpreters without a shared GIL. The simplest possible variant of CIVs that I can think of would be able to avoid that outcome by being a memoryview subclass, since they just need to hold the extra reference to the original interpreter, and include some logic to swtich interpreters at the appropriate time. That said, I think there's definitely a useful design question to ask in this area, not about bytes (which can be readily represented by a memoryview variant in the receiving interpreter), but about *strings*: they have a more complex internal layout than bytes objects, but as long as the receiving interpreter can make sure that the original string continues to exist, then you could usefully implement a "strview" type to avoid having to go through an encode/decode cycle just to pass a string to another subinterpreter. That would provide a reasonable compelling argument that CIVs *shouldn't* be implemented as memoryview subclasses, but instead defined as *containing* a managed view of an object owned by a different interpreter. That way, even if the initial implementation only supported CIVs that contained a memoryview instance, we'd have the freedom to define other kinds of views later (such as strview), while being able to reuse the same CIV machinery. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Thu, Oct 5, 2017 at 4:57 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
This would be hard to get to work reliably, because "orig.tp_share()" would be running in the receiving interpreter, but all the attributes of "orig" would have been allocated by the sending interpreter. It gets more reliable if it's *Channel.send* that calls tp_share() though, but moving the call to the sending side makes it clear that a tp_share protocol would still need to rely on a more primitive set of "shareable objects" that were the permitted return values from the tp_share call.
The point of running tp_share() in the receiving interpreter is to force allocation under that interpreter, so that GC applies there. I agree that you basically can't do anything in tp_share() that would affect the sending interpreter, including INCREF and DECREF. Since we INCREFed in send(), we know that the we have a safe reference, so we don't have to worry about that part in tp_share(). We would only be able to do low-level things (like the buffer protocol) that don't interact with the original object's interpreter. Given that this is a quite low-level tp slot and low-level functionality, I'd expect that a sufficiently clear entry (i.e. warning) in the docs would be enough for the few that dare. <wink>
From my perspective adding the tp_share slot allows for much more experimentation with object sharing (right now, long before we get to considering how to stop sharing the GIL) by us *and* third parties. None of the alternatives seem to offer the same opportunity while still working out *after* we stop sharing the GIL.
And that's the real pay-off that comes from defining this in terms of the memoryview protocol: Py_buffer structs *aren't* Python objects, so it's only a regular C struct that gets passed across the interpreter boundary (the reference to the original objects gets carried along passively as part of the CIV - it never gets *used* in the receiving interpreter).
Yeah, the (PEP 3118) buffer protocol offers precedent in a number of ways that are applicable to channels here. I'm simply reticent to lock PEP 554 into such a specific solution as the buffer-specific CIV. I'm trying to accommodate anticipated future needs while keeping the PEP as simple and basic as possible. It's driving me nuts! :P Things were *much* simpler before I added Channels to the PEP. :)
bytes.tp_share(): obj = blank_bytes(len(self)) obj.ob_sval = self.ob_sval # hand-wavy memory sharing return obj
This is effectively reinventing memoryview, while trying to pretend it's an ordinary bytes object. Don't reinvent memoryview :)
bytes.tp_free(): # under no-shared-GIL: # most of this could be pulled into a macro for re-use orig = lookup_shared(self) if orig != NULL: current = release_LIL() interp = lookup_owner(orig) acquire_LIL(interp) decref(orig) release_LIL(interp) acquire_LIL(current) # clear shared/owner tables # clear/release self.ob_sval free(self)
I don't think we should be touching the behaviour of core builtins solely to enable message passing to subinterpreters without a shared GIL.
Keep in mind that I included the above as a possible solution using tp_share() that would work *after* we stop sharing the GIL. My point is that with tp_share() we have a solution that works now *and* will work later. I don't care how we use tp_share to do so. :) I long to be able to say in the PEP that you can pass bytes through the channel and get bytes on the other side. That said, I'm not sure how this could be made to work without involving tp_free(). If that is really off the table (even in the simplest possible ways) then I don't think there is a way to actually share objects of builtin types between interpreters other than through views like CIV. We could still support tp_share() for the sake of third parties, which would facilitate that simplicity I was aiming for in sending data between interpreters, as well as leaving the door open for nearly all the same experimentation. However, I expect that most *uses* of channels will involve builtin types, particularly as we start off, so having to rely on view types for builtins would add not-insignificant awkwardness to using channels. I'd still like to avoid that if possible, so let's not rush to completely close the door on small modifications to tp_free for builtins. :) Regardless, I still (after a night's rest and a day of not thinking about it) consider tp_share() to be the solution I'd been hoping we'd find, whether or not we can apply it to builtin types.
The simplest possible variant of CIVs that I can think of would be able to avoid that outcome by being a memoryview subclass, since they just need to hold the extra reference to the original interpreter, and include some logic to swtich interpreters at the appropriate time.
That said, I think there's definitely a useful design question to ask in this area, not about bytes (which can be readily represented by a memoryview variant in the receiving interpreter), but about *strings*: they have a more complex internal layout than bytes objects, but as long as the receiving interpreter can make sure that the original string continues to exist, then you could usefully implement a "strview" type to avoid having to go through an encode/decode cycle just to pass a string to another subinterpreter.
That would provide a reasonable compelling argument that CIVs *shouldn't* be implemented as memoryview subclasses, but instead defined as *containing* a managed view of an object owned by a different interpreter.
That way, even if the initial implementation only supported CIVs that contained a memoryview instance, we'd have the freedom to define other kinds of views later (such as strview), while being able to reuse the same CIV machinery.
Hmm, so a CIV implementation that accomplishes something similar to tp_share()? For some reason I'm seeing similarities between CIV-vs.-tp_share and the import machinery before PEP 451. Before we added module specs, import hook authors had to do a bunch of the busy work that the import machinery does for you now by leveraging module specs. Back then we worked to provide a number of helpers to reduce that extra pain of writing an import hook. Now the helpers are irrelevant and the extra burden is gone. My mind is drawn to the comparison between that and the question of CIV vs. tp_share(). CIV would be more like the post-451 import world, where I expect the CIV would take care of the data sharing operations. That said, the situation in PEP 554 is sufficiently different that I'm not convinced a generic CIV protocol would be better. I'm not sure how much CIV could do for you over helpers+tp_share. Anyway, here are the leading approaches that I'm looking at now: * adding a tp_share slot + you send() the object directly and recv() the object coming out of tp_share() (which will probably be the same type as the original) + this would eventually require small changes in tp_free for participating types + we would likely provide helpers (eventually), similar to the new buffer protocol, to make it easier to manage sharing data * simulating tp_share via an external global registry (or a registry on the Channel type) + it would still be hard to make work without hooking into tp_free() * CIVs hard-coded in Channel (or BufferViewChannel, etc.) for specific types (e.g. buffers) + you send() the object like normal, but recv() the view * a CIV protocol on Channel by which you can add support for more types + you send() the object like normal but recv() the view + could work through subclassing or a registry + a lot of conceptual similarity with tp_share+tp_free * a CIV-like proxy + you wrap the object, send() the proxy, and recv() a proxy + this is entirely compatible with tp_share() Here are what I consider the key metrics relative to the utility of a solution (not in any significant order): * how hard to understand as a Python programmer? * how much extra work (if any) for folks calling Channel.send()? * how much extra work (if any) for folks calling Channel.recv()? * how complex is the CPython implementation? * how hard to understand as a type author (wanting to add support for their type)? * how hard to add support for a new type? * what variety of types could be supported? * what breadth of experimentation opens up? The most important thing to me is keeping things simple for Python programmers. After that is ease-of-use for type authors. However, I also want to put us in a good position in 3.7 to experiment extensively with subinterpreters, so that's a big consideration. Consequently, for PEP 554 my goal is to find a solution for object sharing that keeps things simple in Python while laying a basic foundation we can build on at the C level, so we don't get locked in but still maximize our opportunities to experiment. :) -eric
On 6 October 2017 at 11:48, Eric Snow <ericsnowcurrently@gmail.com> wrote:
And that's the real pay-off that comes from defining this in terms of the memoryview protocol: Py_buffer structs *aren't* Python objects, so it's only a regular C struct that gets passed across the interpreter boundary (the reference to the original objects gets carried along passively as part of the CIV - it never gets *used* in the receiving interpreter).
Yeah, the (PEP 3118) buffer protocol offers precedent in a number of ways that are applicable to channels here. I'm simply reticent to lock PEP 554 into such a specific solution as the buffer-specific CIV. I'm trying to accommodate anticipated future needs while keeping the PEP as simple and basic as possible. It's driving me nuts! :P Things were *much* simpler before I added Channels to the PEP. :)
Starting with memory-sharing only doesn't lock us into anything, since you can still add a more flexible kind of channel based on a different protocol later if it turns out that memory sharing isn't enough. By contrast, if you make the initial channel semantics incompatible with multiprocessing by design, you *will* prevent anyone from experimenting with replicating the shared memory based channel API for communicating between processes :) That said, if you'd prefer to keep the "Channel" name available for the possible introduction of object channels at a later date, you could call the initial memoryview based channel a "MemChannel".
I don't think we should be touching the behaviour of core builtins solely to enable message passing to subinterpreters without a shared GIL.
Keep in mind that I included the above as a possible solution using tp_share() that would work *after* we stop sharing the GIL. My point is that with tp_share() we have a solution that works now *and* will work later. I don't care how we use tp_share to do so. :) I long to be able to say in the PEP that you can pass bytes through the channel and get bytes on the other side.
Memory views are a builtin type as well, and they emphasise the practical benefit we're trying to get relative to typical multiprocessing arranagements: zero-copy data sharing. So here's my proposed experimentation-enabling development strategy: 1. Start out with a MemChannel API, that accepts any buffer-exporting object as input, and outputs only a cross-interpreter memoryview subclass 2. Use that as the basis for the work to get to a per-interpreter locking arrangement that allows subinterpreters to fully exploit multiple CPUs 3. Only then try to design a Channel API that allows for sharing builtin immutable objects between interpreters (bytes, strings, numbers), at a time when you can be certain you won't be inadvertently making it harder to make the GIL a truly per-interpreter lock, rather than the current process global runtime lock. The key benefit of this approach is that we *know* MemChannel can work: the buffer protocol already operates at the level of C structs and pointers, not Python objects, and there are already plenty of interesting buffer-protocol-supporting objects around, so as long as the CIV switches interpreters at the right time, there aren't any fundamentally new runtime level capabilities needed to implement it. The lower level MemChannel API could then also be replicated for multiprocessing, while the higher level more speculative object-based Channel API would be specific to subinterpreters (and probably only ever designed and implemented if you first succeed in making subinterpreters sufficiently independent that they don't rely on a process-wide GIL any more). So I'm not saying "Never design an object-sharing protocol specifically for use with subinterpreters". I'm saying "You don't have a demonstrated need for that yet, so don't try to define it until you do".
My mind is drawn to the comparison between that and the question of CIV vs. tp_share(). CIV would be more like the post-451 import world, where I expect the CIV would take care of the data sharing operations. That said, the situation in PEP 554 is sufficiently different that I'm not convinced a generic CIV protocol would be better. I'm not sure how much CIV could do for you over helpers+tp_share.
Anyway, here are the leading approaches that I'm looking at now:
* adding a tp_share slot + you send() the object directly and recv() the object coming out of tp_share() (which will probably be the same type as the original) + this would eventually require small changes in tp_free for participating types + we would likely provide helpers (eventually), similar to the new buffer protocol, to make it easier to manage sharing data
I'm skeptical about this approach because you'll be designing in a vacuum against future possible constraints that you can't test yet: the inherent complexity in the object sharing protocol will come from *not* having a process-wide GIL, but you'll be starting out with a process-wide GIL still in place. And that means third parties will inevitably rely on the process-wide GIL in their tp_share implementations (despite their best intentions), and you'll end up with the same issue that causes problems for the rest of the C API. By contrast, if you delay this step until *after* the GIL has successfully been shifted to being per-interpreter, then by the time the new protocol is defined, people will also be able to test their tp_share implementations properly. At that point, you'd also presumably have evidence of demand to justify the introduction of a new core language protocol, as: * folks will only complain about the limitations of MemChannel if they're actually using subinterpreters * the complaints about the limitations of MemChannel would help guide the object sharing protocol design
* simulating tp_share via an external global registry (or a registry on the Channel type) + it would still be hard to make work without hooking into tp_free() * CIVs hard-coded in Channel (or BufferViewChannel, etc.) for specific types (e.g. buffers) + you send() the object like normal, but recv() the view * a CIV protocol on Channel by which you can add support for more types + you send() the object like normal but recv() the view + could work through subclassing or a registry + a lot of conceptual similarity with tp_share+tp_free * a CIV-like proxy + you wrap the object, send() the proxy, and recv() a proxy + this is entirely compatible with tp_share()
* Allow for multiple channel types, such that MemChannel is merely the *first* channel type, rather than the *only* channel type + Allows PEP 554 to be restricted to things we already know can be made to work + Doesn't block the introduction of an object-sharing based Channel in some future release + Allows for at least some channel types to be adapted for use with shared memory and multiprocessing
Here are what I consider the key metrics relative to the utility of a solution (not in any significant order):
* how hard to understand as a Python programmer?
Not especially important yet - this is more a criterion for the final API, not the initial experimental platform.
* how much extra work (if any) for folks calling Channel.send()? * how much extra work (if any) for folks calling Channel.recv()?
I don't think either are particularly important yet, although we also don't want to raise any pointless barriers to experimentation.
* how complex is the CPython implementation?
This is critical, since we want to minimise any potential for undesirable side effects on regular single interpreter code.
* how hard to understand as a type author (wanting to add support for their type)? * how hard to add support for a new type? * what variety of types could be supported? * what breadth of experimentation opens up?
You missed the big one: what risk does the initial channel design pose to the underlying objective of making the GIL a genuinely per-interpreter lock? If we don't eventually reach the latter goal, then subinterpreters won't really offer much in the way of compelling benefits over just using a thread pool and queue.Queue. MemChannel poses zero additional risk to that, since we wouldn't be sharing actual Python objects between interpreters, only C pointers and structs. By contrast, introducing an object channel early poses significant new risks to that goal, since it will force you to solve hard protocol design and refcount management problems *before* making the switch, rather than being able to defer the design of the object channel protocol until *after* you've already enabled the ability to run subinterpreters in completely independent threads.
The most important thing to me is keeping things simple for Python programmers. After that is ease-of-use for type authors. However, I also want to put us in a good position in 3.7 to experiment extensively with subinterpreters, so that's a big consideration.
Consequently, for PEP 554 my goal is to find a solution for object sharing that keeps things simple in Python while laying a basic foundation we can build on at the C level, so we don't get locked in but still maximize our opportunities to experiment. :)
I think our priorities are quite different then, as I believe PEP 554 should be focused on defining a relatively easy to implement API that nevertheless makes it possible to write interesting programs while working on the goal of making the GIL per-interpreter, without worrying too much about whether or not the initial cross-interpreter communication channels closely resemble the final ones that will be intended for more general use. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
While I'm actually trying not to say much here so that I can avoid this discussion now, here's just a couple of ideas and thoughts from me at this point: (A) Instead of sending bytes and receiving memoryviews, one could consider sending *and* receiving memoryviews for now. That could then be extended into more types of objects in the future without changing the basic concept of the channel. Probably, the memoryview would need to be copied (but not the data of course). But I'm guessing copying a memoryview would be quite fast. This would hopefully require less API changes or additions in the future. OTOH, giving it a different name like MemChannel or making it 3rd party will buy some more time to figure out the right API. But maybe that's not needed. (B) We would probably then like to pretend that the object coming out the other end of a Channel *is* the original object. As long as these channels are the only way to directly pass objects between interpreters, there are essentially only two ways to tell the difference (AFAICT): 1. Calling id(...) and sending it over to the other interpreter and checking if it's the same. 2. When the same object is sent twice to the same interpreter. Then one can compare the two with id(...) or using the `is` operator. There are solutions to the problems too: 1. Send the id() from the sending interpreter along with the sent object so that the receiving interpreter can somehow attach it to the object and then return it from id(...). 2. When an object is received, make a lookup in an interpreter-wide cache to see if an object by this id has already been received. If yes, take that one. Now it should essentially look like the received object is really "the same one" as in the sending interpreter. This should also work with multiple interpreters and multiple channels, as long as the id is always preserved. (C) One further complication regarding memoryview in general is that .release() should probably be propagated to the sending interpreter somehow. (D) I think someone already mentioned this one, but would it not be better to start a new interpreter in the background in a new thread by default? I think this would make things simpler and leave more freedom regarding the implementation in the future. If you need to run an interpreter within the current thread, you could perhaps optionally do that too. ––Koos PS. I have lots of thoughts related to this, but I can't afford to engage in them now. (Anyway, it's probably more urgent to get some stuff with PEP 555 and its spin-off thoughts out of the way). On Fri, Oct 6, 2017 at 6:38 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 6 October 2017 at 11:48, Eric Snow <ericsnowcurrently@gmail.com> wrote:
And that's the real pay-off that comes from defining this in terms of the memoryview protocol: Py_buffer structs *aren't* Python objects, so it's only a regular C struct that gets passed across the interpreter boundary (the reference to the original objects gets carried along passively as part of the CIV - it never gets *used* in the receiving interpreter).
Yeah, the (PEP 3118) buffer protocol offers precedent in a number of ways that are applicable to channels here. I'm simply reticent to lock PEP 554 into such a specific solution as the buffer-specific CIV. I'm trying to accommodate anticipated future needs while keeping the PEP as simple and basic as possible. It's driving me nuts! :P Things were *much* simpler before I added Channels to the PEP. :)
Starting with memory-sharing only doesn't lock us into anything, since you can still add a more flexible kind of channel based on a different protocol later if it turns out that memory sharing isn't enough.
By contrast, if you make the initial channel semantics incompatible with multiprocessing by design, you *will* prevent anyone from experimenting with replicating the shared memory based channel API for communicating between processes :)
That said, if you'd prefer to keep the "Channel" name available for the possible introduction of object channels at a later date, you could call the initial memoryview based channel a "MemChannel".
I don't think we should be touching the behaviour of core builtins solely to enable message passing to subinterpreters without a shared GIL.
Keep in mind that I included the above as a possible solution using tp_share() that would work *after* we stop sharing the GIL. My point is that with tp_share() we have a solution that works now *and* will work later. I don't care how we use tp_share to do so. :) I long to be able to say in the PEP that you can pass bytes through the channel and get bytes on the other side.
Memory views are a builtin type as well, and they emphasise the practical benefit we're trying to get relative to typical multiprocessing arranagements: zero-copy data sharing.
So here's my proposed experimentation-enabling development strategy:
1. Start out with a MemChannel API, that accepts any buffer-exporting object as input, and outputs only a cross-interpreter memoryview subclass 2. Use that as the basis for the work to get to a per-interpreter locking arrangement that allows subinterpreters to fully exploit multiple CPUs 3. Only then try to design a Channel API that allows for sharing builtin immutable objects between interpreters (bytes, strings, numbers), at a time when you can be certain you won't be inadvertently making it harder to make the GIL a truly per-interpreter lock, rather than the current process global runtime lock.
The key benefit of this approach is that we *know* MemChannel can work: the buffer protocol already operates at the level of C structs and pointers, not Python objects, and there are already plenty of interesting buffer-protocol-supporting objects around, so as long as the CIV switches interpreters at the right time, there aren't any fundamentally new runtime level capabilities needed to implement it.
The lower level MemChannel API could then also be replicated for multiprocessing, while the higher level more speculative object-based Channel API would be specific to subinterpreters (and probably only ever designed and implemented if you first succeed in making subinterpreters sufficiently independent that they don't rely on a process-wide GIL any more).
So I'm not saying "Never design an object-sharing protocol specifically for use with subinterpreters". I'm saying "You don't have a demonstrated need for that yet, so don't try to define it until you do".
My mind is drawn to the comparison between that and the question of CIV vs. tp_share(). CIV would be more like the post-451 import world, where I expect the CIV would take care of the data sharing operations. That said, the situation in PEP 554 is sufficiently different that I'm not convinced a generic CIV protocol would be better. I'm not sure how much CIV could do for you over helpers+tp_share.
Anyway, here are the leading approaches that I'm looking at now:
* adding a tp_share slot + you send() the object directly and recv() the object coming out of tp_share() (which will probably be the same type as the original) + this would eventually require small changes in tp_free for participating types + we would likely provide helpers (eventually), similar to the new buffer protocol, to make it easier to manage sharing data
I'm skeptical about this approach because you'll be designing in a vacuum against future possible constraints that you can't test yet: the inherent complexity in the object sharing protocol will come from *not* having a process-wide GIL, but you'll be starting out with a process-wide GIL still in place. And that means third parties will inevitably rely on the process-wide GIL in their tp_share implementations (despite their best intentions), and you'll end up with the same issue that causes problems for the rest of the C API.
By contrast, if you delay this step until *after* the GIL has successfully been shifted to being per-interpreter, then by the time the new protocol is defined, people will also be able to test their tp_share implementations properly.
At that point, you'd also presumably have evidence of demand to justify the introduction of a new core language protocol, as:
* folks will only complain about the limitations of MemChannel if they're actually using subinterpreters * the complaints about the limitations of MemChannel would help guide the object sharing protocol design
* simulating tp_share via an external global registry (or a registry on the Channel type) + it would still be hard to make work without hooking into tp_free() * CIVs hard-coded in Channel (or BufferViewChannel, etc.) for specific types (e.g. buffers) + you send() the object like normal, but recv() the view * a CIV protocol on Channel by which you can add support for more types + you send() the object like normal but recv() the view + could work through subclassing or a registry + a lot of conceptual similarity with tp_share+tp_free * a CIV-like proxy + you wrap the object, send() the proxy, and recv() a proxy + this is entirely compatible with tp_share()
* Allow for multiple channel types, such that MemChannel is merely the *first* channel type, rather than the *only* channel type + Allows PEP 554 to be restricted to things we already know can be made to work + Doesn't block the introduction of an object-sharing based Channel in some future release + Allows for at least some channel types to be adapted for use with shared memory and multiprocessing
Here are what I consider the key metrics relative to the utility of a solution (not in any significant order):
* how hard to understand as a Python programmer?
Not especially important yet - this is more a criterion for the final API, not the initial experimental platform.
* how much extra work (if any) for folks calling Channel.send()? * how much extra work (if any) for folks calling Channel.recv()?
I don't think either are particularly important yet, although we also don't want to raise any pointless barriers to experimentation.
* how complex is the CPython implementation?
This is critical, since we want to minimise any potential for undesirable side effects on regular single interpreter code.
* how hard to understand as a type author (wanting to add support for their type)? * how hard to add support for a new type? * what variety of types could be supported? * what breadth of experimentation opens up?
You missed the big one: what risk does the initial channel design pose to the underlying objective of making the GIL a genuinely per-interpreter lock?
If we don't eventually reach the latter goal, then subinterpreters won't really offer much in the way of compelling benefits over just using a thread pool and queue.Queue.
MemChannel poses zero additional risk to that, since we wouldn't be sharing actual Python objects between interpreters, only C pointers and structs.
By contrast, introducing an object channel early poses significant new risks to that goal, since it will force you to solve hard protocol design and refcount management problems *before* making the switch, rather than being able to defer the design of the object channel protocol until *after* you've already enabled the ability to run subinterpreters in completely independent threads.
The most important thing to me is keeping things simple for Python programmers. After that is ease-of-use for type authors. However, I also want to put us in a good position in 3.7 to experiment extensively with subinterpreters, so that's a big consideration.
Consequently, for PEP 554 my goal is to find a solution for object sharing that keeps things simple in Python while laying a basic foundation we can build on at the C level, so we don't get locked in but still maximize our opportunities to experiment. :)
I think our priorities are quite different then, as I believe PEP 554 should be focused on defining a relatively easy to implement API that nevertheless makes it possible to write interesting programs while working on the goal of making the GIL per-interpreter, without worrying too much about whether or not the initial cross-interpreter communication channels closely resemble the final ones that will be intended for more general use.
Cheers, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ k7hoven%40gmail.com
-- + Koos Zevenhoven + http://twitter.com/k7hoven +
On 7 October 2017 at 02:29, Koos Zevenhoven <k7hoven@gmail.com> wrote:
While I'm actually trying not to say much here so that I can avoid this discussion now, here's just a couple of ideas and thoughts from me at this point:
(A) Instead of sending bytes and receiving memoryviews, one could consider sending *and* receiving memoryviews for now. That could then be extended into more types of objects in the future without changing the basic concept of the channel. Probably, the memoryview would need to be copied (but not the data of course). But I'm guessing copying a memoryview would be quite fast.
The proposal is to allow sending any buffer-exporting object, so sending a memoryview would be supported.
This would hopefully require less API changes or additions in the future. OTOH, giving it a different name like MemChannel or making it 3rd party will buy some more time to figure out the right API. But maybe that's not needed.
I think having both a memory-centric data channel and an object-centric data channel would be useful long term, so I don't see a lot of downsides to starting with the easier-to-implement MemChannel, and then looking at how to define a plain Channel later. For example, it occurs to me is that the closest current equivalent we have to an object level counterpart to the memory buffer protocol would be the weak reference protocol, wherein a multi-interpreter-aware proxy object could actually take care of switching interpreters as needed when manipulating reference counts. While weakrefs themselves wouldn't be usable in the general case (many builtin types don't support weak references, and we'd want to support strong cross-interpreter references anyway), a wrapt-style object proxy would provide us with a way to maintain a single strong reference to the original object in its originating interpreter (implicitly switching to that interpreter as needed), while also maintaining a regular local reference count on the proxy object in the receiving interpreter. And here's the neat thing: since subinterpreters share an address space, it would be possible to experiment with an object-proxy based channel by passing object pointers over a memoryview based channel.
(B) We would probably then like to pretend that the object coming out the other end of a Channel *is* the original object. As long as these channels are the only way to directly pass objects between interpreters, there are essentially only two ways to tell the difference (AFAICT):
1. Calling id(...) and sending it over to the other interpreter and checking if it's the same.
2. When the same object is sent twice to the same interpreter. Then one can compare the two with id(...) or using the `is` operator.
There are solutions to the problems too:
1. Send the id() from the sending interpreter along with the sent object so that the receiving interpreter can somehow attach it to the object and then return it from id(...).
2. When an object is received, make a lookup in an interpreter-wide cache to see if an object by this id has already been received. If yes, take that one.
Now it should essentially look like the received object is really "the same one" as in the sending interpreter. This should also work with multiple interpreters and multiple channels, as long as the id is always preserved.
I don't personally think we want to expend much (if any) effort on presenting the illusion that the objects on either end of the channel are the "same" object, but postponing the question entirely is also one of the benefits I see to starting with MemChannel, and leaving the object-centric Channel until later.
(C) One further complication regarding memoryview in general is that .release() should probably be propagated to the sending interpreter somehow.
Yep, switching interpreters when releasing the buffer is the main reason you couldn't use a regular memoryview for this purpose - you need a variant that holds a strong reference to the sending interpreter, and switches back to it for the buffer release operation.
(D) I think someone already mentioned this one, but would it not be better to start a new interpreter in the background in a new thread by default? I think this would make things simpler and leave more freedom regarding the implementation in the future. If you need to run an interpreter within the current thread, you could perhaps optionally do that too.
Not really, as that approach doesn't compose as well with existing thread management primitives like concurrent.futures.ThreadPoolExecutor. It also doesn't match the way the existing subinterpreter machinery works, where threads can change their active interpreter. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Mon, 2 Oct 2017 21:31:30 -0400 Eric Snow <ericsnowcurrently@gmail.com> wrote:
By contrast, if we allow an actual bytes object to be shared, then either every INCREF or DECREF on that bytes object becomes a synchronisation point, or else we end up needing some kind of secondary per-interpreter refcount where the interpreter doesn't drop its shared reference to the original object in its source interpreter until the internal refcount in the borrowing interpreter drops to zero.
There shouldn't be a need to synchronize on INCREF. If both interpreters have at least 1 reference then either one adding a reference shouldn't be a problem.
I'm not sure what Nick meant by "synchronization point", but at least you certainly need INCREF and DECREF to be atomic, which is a departure from today's Py_INCREF / Py_DECREF behaviour (and is significantly slower, even on high-level benchmarks). Regards Antoine.
On Wed, 4 Oct 2017 17:50:33 +0200 Antoine Pitrou <solipsis@pitrou.net> wrote:
On Mon, 2 Oct 2017 21:31:30 -0400 Eric Snow <ericsnowcurrently@gmail.com> wrote:
By contrast, if we allow an actual bytes object to be shared, then either every INCREF or DECREF on that bytes object becomes a synchronisation point, or else we end up needing some kind of secondary per-interpreter refcount where the interpreter doesn't drop its shared reference to the original object in its source interpreter until the internal refcount in the borrowing interpreter drops to zero.
There shouldn't be a need to synchronize on INCREF. If both interpreters have at least 1 reference then either one adding a reference shouldn't be a problem.
I'm not sure what Nick meant by "synchronization point", but at least you certainly need INCREF and DECREF to be atomic, which is a departure from today's Py_INCREF / Py_DECREF behaviour (and is significantly slower, even on high-level benchmarks).
To be clear, I'm writing this under the hypothesis of per-interpreter GILs. I'm not really interested in the per-process GIL case :-) Regards Antoine.
On 14 September 2017 at 11:44, Eric Snow <ericsnowcurrently@gmail.com> wrote:
About Subinterpreters =====================
Shared data -----------
[snip]
To make this work, the mutable shared state will be managed by the Python runtime, not by any of the interpreters. Initially we will support only one type of objects for shared state: the channels provided by ``create_channel()``. Channels, in turn, will carefully manage passing objects between interpreters.
Something I think you may want to explicitly call out as *not* being shared is the thread objects in threading.enumerate(), as the way that works in the current implementation makes sense, but isn't particularly obvious (what I have below comes from experimenting with your branch at https://github.com/python/cpython/pull/1748). Specifically, what happens is that the operating system thread underlying the existing interpreter thread that calls interp.run() gets borrowed as the operating system thread underlying the MainThread object in the called interpreter. That MainThread object then gets preserved in the interpreter's interpreter state, but the mapping to an underlying OS thread will change freely based on who's calling into it. From outside an interpreter, you *can't* request to run code in subthreads directly - you'll always run your given code in the main thread, and it will be up to that to dispatch requests to subthreads. Beyond the thread lending that happens when you call interp.run() (where one of your threads gets borrowed as the other interpreter's main thread), each interpreter otherwise maintains a completely disjoint set of thread objects that it is solely responsible for. This also clarifies for me what it means for an interpreter to be a "main" interpreter: it's the interpreter who's main thread actually corresponds to the main thread of the overall operating system process, rather than being temporarily borrowed from another interpreter. We're going to have to put some thought into how we want that to interact with the signal handling logic - right now, I believe *any* main thread will consider it its responsibility to process signals delivered to the runtime (and embedding application avoid the potential problems arising from that by simply not installing the CPython signal handlers in the first place), and we probably want to change that condition to be "the main thread in the main interpreter". Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Hi, First my high-level opinion about the PEP: the CSP model can probably be already implemented using Queues. To me, the interesting promise of subinterpreters is if they allow to remove the GIL while sharing memory for big objects (such as Numpy arrays). This means the PEP should probably focus on potential concurrency improvements rather than try to faithfully follow the CSP model. Other than that, a bunch of detailed comments follow: On Wed, 13 Sep 2017 18:44:31 -0700 Eric Snow <ericsnowcurrently@gmail.com> wrote:
API for interpreters --------------------
The module provides the following functions:
``list_all()``::
Return a list of all existing interpreters.
See my naming proposal in the previous thread.
run(source_str, /, **shared):
Run the provided Python source code in the interpreter. Any keyword arguments are added to the interpreter's execution namespace.
"Execution namespace" specifically means the __main__ module in the target interpreter, right?
If any of the values are not supported for sharing between interpreters then RuntimeError gets raised. Currently only channels (see "create_channel()" below) are supported.
This may not be called on an already running interpreter. Doing so results in a RuntimeError.
I would distinguish between both error cases: RuntimeError for calling run() on an already running interpreter, ValueError for values which are not supported for sharing.
Likewise, if there is any uncaught exception, it propagates into the code where "run()" was called.
That makes it a bit harder to differentiate with errors raised by run() itself (see above), though how much of an annoyance this is remains unclear. The more litigious implication, though, is that it forces the interpreter to support migration of arbitrary objects from one interpreter to another (since a traceback keeps all local variables alive).
API for sharing data --------------------
The mechanism for passing objects between interpreters is through channels. A channel is a simplex FIFO similar to a pipe. The main difference is that channels can be associated with zero or more interpreters on either end.
So it seems channels have become more complicated now? Is it important to support multi-producer multi-consumer channels?
Unlike queues, which are also many-to-many, channels have no buffer.
How does it work? Does send() block until someone else calls recv()? That does not sound like a good idea to me. I don't think it's a coincidence that the most varied kinds of I/O (from socket or file IO to threading Queues to multiprocessing Pipes) have non-blocking send(). send() blocking until someone else calls recv() is not only bad for performance, it also increases the likelihood of deadlocks.
recv_nowait(default=None):
Return the next object from the channel. If none have been sent then return the default. If the channel has been closed then EOFError is raised.
close():
No longer associate the current interpreter with the channel (on the receiving end). This is a noop if the interpreter isn't already associated. Once an interpreter is no longer associated with the channel, subsequent (or current) send() and recv() calls from that interpreter will raise EOFError.
EOFError normally means the *other* (sending) side has closed the channel (but it becomes complicated with a multi-producer multi-consumer setup...). When *this* side has closed the channel, we should raise ValueError.
The Python runtime will garbage collect all closed channels. Note that "close()" is automatically called when it is no longer used in the current interpreter.
"No longer used" meaning it loses all references in this interpreter?
send(obj):
Send the object to the receiving end of the channel. Wait until the object is received. If the channel does not support the object then TypeError is raised. Currently only bytes are supported. If the channel has been closed then EOFError is raised.
Similar remark as above (EOFError vs. ValueError). More generally, send() raising EOFError sounds unheard of. A sidenote: context manager support (__enter__ / __exit__) on channels would sound more useful to me than iteration support.
Initial support for buffers in channels ---------------------------------------
An alternative to support for bytes in channels in support for read-only buffers (the PEP 3119 kind).
Probably you mean PEP 3118.
Then ``recv()`` would return a memoryview to expose the buffer in a zero-copy way.
It will probably not do much if you only can pass buffers and not structured objects, because unserializing (e.g. unpickling) from a buffer will still copy memory around. To pass a Numpy array, for example, you not only need to pass its contents but also its metadata (its value type -- named "dtype" --, its shape and strides). This may be serialized as simple tuples of atomic types (str, int, bytes, other tuples), but you want to include a memoryview of the data area somewhere in those tuples. (and, of course, at some point, this will feel like reinventing pickle :)) but pickle has no mechanism to avoid memory copies, so it can't readily be reused here -- otherwise you're just reinventing multiprocessing...)
timeout arg to pop() and push() -------------------------------
pop() and push() don't exist anymore :-)
Synchronization Primitives --------------------------
The ``threading`` module provides a number of synchronization primitives for coordinating concurrent operations. This is especially necessary due to the shared-state nature of threading. In contrast, subinterpreters do not share state. Data sharing is restricted to channels, which do away with the need for explicit synchronization.
I think this rationale confuses Python-level data sharing with process-level data sharing. The main point of subinterpreters (compared to multiprocessing) is that they live in the same OS process. So it's really not true that you can't share a low-level synchronization primitive (say a semaphore) between subinterpreters. (also see multiprocessing/synchronize.py, which implements all synchronization primitives using basic low-level semaphores)
Solutions include:
* a ``create()`` arg to indicate resetting ``__main__`` after each ``run`` call * an ``Interpreter.reset_main`` flag to support opting in or out after the fact * an ``Interpreter.reset_main()`` method to opt in when desired
This would all be a false promise. Persistent state lives in other places than __main__ (for example the loaded modules and their respective configurations - think logging or decimal).
Use queues instead of channels ------------------------------
The main difference between queues and channels is that queues support buffering. This would complicate the blocking semantics of ``recv()`` and ``send()``. Also, queues can be built on top of channels.
But buffering with background threads in pure Python will be order of magnitudes slower than optimized buffering in a custom low-level implementation. It would be a pity if a subinterpreters Queue ended out as slow as a multiprocessing Queue. Regards Antoine.
Thanks for the feedback, Antoine. Sorry for the delay; it's been a busy week for me. I just pushed an updated PEP to the repo. Once I've sorted out the question of passing bytes through channels I plan on posting the PEP to the list again for another round of discussion. In the meantime, I've replied below in-line. -eric On Mon, Sep 18, 2017 at 4:46 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
First my high-level opinion about the PEP: the CSP model can probably be already implemented using Queues. To me, the interesting promise of subinterpreters is if they allow to remove the GIL while sharing memory for big objects (such as Numpy arrays). This means the PEP should probably focus on potential concurrency improvements rather than try to faithfully follow the CSP model.
Please elaborate. I'm interested in understanding what you mean here. Do you have some subinterpreter-based concurrency improvements in mind? What aspect of CSP is the PEP following too faithfully?
``list_all()``::
Return a list of all existing interpreters.
See my naming proposal in the previous thread.
Sorry, your previous comment slipped through the cracks. You suggested: As for the naming, let's make it both unconfusing and explicit? How about three functions: `all_interpreters()`, `running_interpreters()` and `idle_interpreters()`, for example? As to "all_interpreters()", I suppose it's the difference between "interpreters.all_interpreters()" and "interpreters.list_all()". To me the latter looks better. As to "running_interpreters()" and "idle_interpreters()", I'm not sure what the benefit would be. You can compose either list manually with a simple comprehension: [interp for interp in interpreters.list_all() if interp.is_running()] [interp for interp in interpreters.list_all() if not interp.is_running()]
run(source_str, /, **shared):
Run the provided Python source code in the interpreter. Any keyword arguments are added to the interpreter's execution namespace.
"Execution namespace" specifically means the __main__ module in the target interpreter, right?
Right. It's explained in more detail a little further down and elsewhere in the PEP. I've updated the PEP to explicitly mention __main__ here too.
If any of the values are not supported for sharing between interpreters then RuntimeError gets raised. Currently only channels (see "create_channel()" below) are supported.
This may not be called on an already running interpreter. Doing so results in a RuntimeError.
I would distinguish between both error cases: RuntimeError for calling run() on an already running interpreter, ValueError for values which are not supported for sharing.
Good point.
Likewise, if there is any uncaught exception, it propagates into the code where "run()" was called.
That makes it a bit harder to differentiate with errors raised by run() itself (see above), though how much of an annoyance this is remains unclear. The more litigious implication, though, is that it forces the interpreter to support migration of arbitrary objects from one interpreter to another (since a traceback keeps all local variables alive).
Yeah, the proposal to propagate exceptions out of the subinterpreter is still rather weak. I've added some notes the the PEP about this open issue.
The mechanism for passing objects between interpreters is through channels. A channel is a simplex FIFO similar to a pipe. The main difference is that channels can be associated with zero or more interpreters on either end.
So it seems channels have become more complicated now? Is it important to support multi-producer multi-consumer channels?
To me it made the API simpler. The change did introduce the "close()" method, which I suppose could be confusing. However, I'm sure that in practice it won't be. In contrast, the FIFO/pipe-based API that I had before required passing names around, required more calls, required managing the channel/interpreter relationship more carefully, and made it hard to follow that relationship.
Unlike queues, which are also many-to-many, channels have no buffer.
How does it work? Does send() block until someone else calls recv()? That does not sound like a good idea to me.
Correct "send()" blocks until the other end receives (if ever). Likewise "recv()" blocks until the other end sends. This specific behavior is probably the main thing I borrowed from CSP. It is *the* synchronization mechanism. Given the isolated nature of subinterpreters, I consider using this concept from CSP to be a good fit.
I don't think it's a coincidence that the most varied kinds of I/O (from socket or file IO to threading Queues to multiprocessing Pipes) have non-blocking send().
Interestingly, you can set sockets to blocking mode, in which case send() will block until there is room in the kernel buffer. Likewise, queue.Queue.send() supports blocking, in addition to providing a put_nowait() method. Note that the PEP provides "recv_nowait()" and "send_nowait()" (names inspired by queue.Queue), allowing for a non-blocking send. It's just not the default. I deliberated for a little while on which one to make the default. In the end I went with blocking-by-default to stick to the CSP model. However, I want to do what's most practical for users. I can imagine folks at first not expecting blocking send by default. However, it otherwise isn't clear yet which one is better for interpreter channels. I'll add on "open question" about switching to non-blocking-by-default for send().
send() blocking until someone else calls recv() is not only bad for performance,
What is the performance problem?
it also increases the likelihood of deadlocks.
How much of a problem will deadlocks be in practice? (FWIW, CSP provides rigorous guarantees about deadlock detection (which Go leverages), though I'm not sure how much benefit that can offer such a dynamic language as Python.) Regardless, I'll make sure the PEP discusses deadlocks.
EOFError normally means the *other* (sending) side has closed the channel (but it becomes complicated with a multi-producer multi-consumer setup...). When *this* side has closed the channel, we should raise ValueError.
I've fixed this in the PEP.
The Python runtime will garbage collect all closed channels. Note that "close()" is automatically called when it is no longer used in the current interpreter.
"No longer used" meaning it loses all references in this interpreter?
Correct. I've clarified this in the PEP.
Similar remark as above (EOFError vs. ValueError). More generally, send() raising EOFError sounds unheard of.
Hmm. I've fixed this in the PEP, but perhaps using EOFError here (and even for read()) isn't right. I was drawing inspiration from pipes, but certainly the semantics aren't exactly the same. So it may make sense to use something else less I/O-related, like a new exception type in the "interpreters" module. I'll make a note in the PEP about this.
A sidenote: context manager support (__enter__ / __exit__) on channels would sound more useful to me than iteration support.
Yeah, I can see that. FWIW, I've dropped __next__() from the PEP. I've also added a note about added context manager support.
An alternative to support for bytes in channels in support for read-only buffers (the PEP 3119 kind).
Probably you mean PEP 3118.
Yep. :)
Then ``recv()`` would return a memoryview to expose the buffer in a zero-copy way.
It will probably not do much if you only can pass buffers and not structured objects, because unserializing (e.g. unpickling) from a buffer will still copy memory around.
To pass a Numpy array, for example, you not only need to pass its contents but also its metadata (its value type -- named "dtype" --, its shape and strides). This may be serialized as simple tuples of atomic types (str, int, bytes, other tuples), but you want to include a memoryview of the data area somewhere in those tuples.
(and, of course, at some point, this will feel like reinventing pickle :)) but pickle has no mechanism to avoid memory copies, so it can't readily be reused here -- otherwise you're just reinventing multiprocessing...)
I'm still working through all the passing-buffers-through-channels feedback, so I'll defer on a reply for now. :)
timeout arg to pop() and push() -------------------------------
pop() and push() don't exist anymore :-)
Fixed! :)
Synchronization Primitives --------------------------
The ``threading`` module provides a number of synchronization primitives for coordinating concurrent operations. This is especially necessary due to the shared-state nature of threading. In contrast, subinterpreters do not share state. Data sharing is restricted to channels, which do away with the need for explicit synchronization.
I think this rationale confuses Python-level data sharing with process-level data sharing. The main point of subinterpreters (compared to multiprocessing) is that they live in the same OS process. So it's really not true that you can't share a low-level synchronization primitive (say a semaphore) between subinterpreters.
I'm not sure I understand your concern here. Perhaps I used the word "sharing" too ambiguously? By "sharing" I mean that the two actors have read access to something that at least one of them can modify. If they both only have read-only access then it's effectively the same as if they are not sharing. While I can imagine the *possibility* (some day) of an opt-in mechanism to share objects (r/rw or rw/rw), that is definitely not a part of this PEP. I expect that in reality we will only ever pass immutable data between interpreters. So I'm unclear on what need there might be for any synchronization primitives other than what is inherent to channels.
* a ``create()`` arg to indicate resetting ``__main__`` after each ``run`` call * an ``Interpreter.reset_main`` flag to support opting in or out after the fact * an ``Interpreter.reset_main()`` method to opt in when desired
This would all be a false promise. Persistent state lives in other places than __main__ (for example the loaded modules and their respective configurations - think logging or decimal).
I've added a bit more explanation to the PEP to clarify this point.
The main difference between queues and channels is that queues support buffering. This would complicate the blocking semantics of ``recv()`` and ``send()``. Also, queues can be built on top of channels.
But buffering with background threads in pure Python will be order of magnitudes slower than optimized buffering in a custom low-level implementation. It would be a pity if a subinterpreters Queue ended out as slow as a multiprocessing Queue.
I agree. I'm entirely open to supporting other object-passing types, including adding low-level implementations. I've added a note to the PEP to that effect. However, I wanted to start off with the most basic object-passing type, and I felt that channels provides the simplest solution. My goal is to get a basic API landed in 3.7 and then build on it from there for 3.8. That said, in the interest of enabling extra utility in the near-term, I expect that we will be able to design the PyInterpreterState changes (few as they are) in such a way that a C-extension could implement an efficient multi-interpreter Queue type that would run under 3.7. Actually, would that be strictly necessary if you can interact with channels without the GIL in the C-API? Regardless, I'll make a note in the PEP about the relationship between C-API and implementing an efficient multi-interepter Queue. I suppose that means I need to add C-API changes to the PEP (which I had wanted to avoid).
Hi Eric, On Fri, 22 Sep 2017 19:09:01 -0600 Eric Snow <ericsnowcurrently@gmail.com> wrote:
Please elaborate. I'm interested in understanding what you mean here. Do you have some subinterpreter-based concurrency improvements in mind? What aspect of CSP is the PEP following too faithfully?
See below the discussion of blocking send()s :-)
As to "running_interpreters()" and "idle_interpreters()", I'm not sure what the benefit would be. You can compose either list manually with a simple comprehension:
[interp for interp in interpreters.list_all() if interp.is_running()] [interp for interp in interpreters.list_all() if not interp.is_running()]
There is a inherit race condition in doing that, at least if interpreters are running in multiple threads (which I assume is going to be the overly dominant usage model). That is why I'm proposing all three variants.
I don't think it's a coincidence that the most varied kinds of I/O (from socket or file IO to threading Queues to multiprocessing Pipes) have non-blocking send().
Interestingly, you can set sockets to blocking mode, in which case send() will block until there is room in the kernel buffer.
Yes, but there *is* a kernel buffer. Which is the whole point of my comment: most alike primitives have internal buffering to prevent the user-facing send() API from blocking in the common case.
Likewise, queue.Queue.send() supports blocking, in addition to providing a put_nowait() method.
queue.Queue.put() never blocks in the usual case (*), which is of an unbounded queue. Only bounded queues (created with an explicit non-zero max_size parameter) can block in Queue.put(). (*) and therefore also never deadlocks :-)
Note that the PEP provides "recv_nowait()" and "send_nowait()" (names inspired by queue.Queue), allowing for a non-blocking send.
True, but it's not the same thing at all. In the objects I mentioned, send() mostly doesn't block and doesn't fail either. In your model, send_nowait() will routinely fail with an error if a recipient isn't immediately available to recv the data.
send() blocking until someone else calls recv() is not only bad for performance,
What is the performance problem?
Intuitively, there must be some kind of context switch (interpreter switch?) at each send() call to let the other end receive the data, since you don't have any internal buffering. Also, suddenly an interpreter's ability to exploit CPU time is dependent on another interpreter's ability to consume data in a timely manner (what if the other interpreter is e.g. stuck on some disk I/O?). IMHO it would be better not to have such coupling.
it also increases the likelihood of deadlocks.
How much of a problem will deadlocks be in practice?
I expect more often than expected, in complex systems :-) For example, you could have a recv() loop that also from time to time send()s some data on another queue, depending on what is received. But if that send()'s recipient also has the same structure (a recv() loop which send()s from time to time), then it's easy to imagine to two getting in a deadlock.
(FWIW, CSP provides rigorous guarantees about deadlock detection (which Go leverages), though I'm not sure how much benefit that can offer such a dynamic language as Python.)
Hmm... deadlock detection is one thing, but when detected you must still solve those deadlock issues, right?
I'm not sure I understand your concern here. Perhaps I used the word "sharing" too ambiguously? By "sharing" I mean that the two actors have read access to something that at least one of them can modify. If they both only have read-only access then it's effectively the same as if they are not sharing.
Right. What I mean is that you *can* share very simple "data" under the form of synchronization primitives. You may want to synchronize your interpreters even they don't share user-visible memory areas. The point of synchronization is not only to avoid memory corruption but also to regulate and orchestrate processing amongst multiple workers (for example processes or interpreters). For example, a semaphore is an easy way to implement "I want no more than N workers to do this thing at the same time" ("this thing" can be something such as disk I/O). Regards Antoine.
On 2017-09-23 10:45, Antoine Pitrou wrote:
Hi Eric,
On Fri, 22 Sep 2017 19:09:01 -0600 Eric Snow <ericsnowcurrently@gmail.com> wrote:
Please elaborate. I'm interested in understanding what you mean here. Do you have some subinterpreter-based concurrency improvements in mind? What aspect of CSP is the PEP following too faithfully?
See below the discussion of blocking send()s :-)
As to "running_interpreters()" and "idle_interpreters()", I'm not sure what the benefit would be. You can compose either list manually with a simple comprehension:
[interp for interp in interpreters.list_all() if interp.is_running()] [interp for interp in interpreters.list_all() if not interp.is_running()]
There is a inherit race condition in doing that, at least if interpreters are running in multiple threads (which I assume is going to be the overly dominant usage model). That is why I'm proposing all three variants.
An alternative to 3 variants would be: interpreters.list_all(running=True) interpreters.list_all(running=False) interpreters.list_all(running=None) [snip]
On Sat, Sep 23, 2017 at 2:45 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
As to "running_interpreters()" and "idle_interpreters()", I'm not sure what the benefit would be. You can compose either list manually with a simple comprehension:
[interp for interp in interpreters.list_all() if interp.is_running()] [interp for interp in interpreters.list_all() if not interp.is_running()]
There is a inherit race condition in doing that, at least if interpreters are running in multiple threads (which I assume is going to be the overly dominant usage model). That is why I'm proposing all three variants.
There's a race condition no matter what the API looks like -- having a dedicated running_interpreters() lets you guarantee that the returned list describes the set of interpreters that were running at some moment in time, but you don't know when that moment was and by the time you get the list, it's already out-of-date. So this doesn't seem very useful. OTOH if we think that invariants like this are useful, we might also want to guarantee that calling running_interpreters() and idle_interpreters() gives two lists such that each interpreter appears in exactly one of them, but that's impossible with this API; it'd require a single function that returns both lists. What problem are you trying to solve?
Likewise, queue.Queue.send() supports blocking, in addition to providing a put_nowait() method.
queue.Queue.put() never blocks in the usual case (*), which is of an unbounded queue. Only bounded queues (created with an explicit non-zero max_size parameter) can block in Queue.put().
(*) and therefore also never deadlocks :-)
Unbounded queues also introduce unbounded latency and memory usage in realistic situations. (E.g. a producer/consumer setup where the producer runs faster than the consumer.) There's a reason why sockets always have bounded buffers -- it's sometimes painful, but the pain is intrinsic to building distributed systems, and unbounded buffers just paper over it.
send() blocking until someone else calls recv() is not only bad for performance,
What is the performance problem?
Intuitively, there must be some kind of context switch (interpreter switch?) at each send() call to let the other end receive the data, since you don't have any internal buffering.
Technically you just need the other end to wake up at some time in between any two calls to send(), and if there's no GIL then this doesn't necessarily require a context switch.
Also, suddenly an interpreter's ability to exploit CPU time is dependent on another interpreter's ability to consume data in a timely manner (what if the other interpreter is e.g. stuck on some disk I/O?). IMHO it would be better not to have such coupling.
A small buffer probably is useful in some cases, yeah -- basically enough to smooth out scheduler jitter.
it also increases the likelihood of deadlocks.
How much of a problem will deadlocks be in practice?
I expect more often than expected, in complex systems :-) For example, you could have a recv() loop that also from time to time send()s some data on another queue, depending on what is received. But if that send()'s recipient also has the same structure (a recv() loop which send()s from time to time), then it's easy to imagine to two getting in a deadlock.
You kind of want to be able to create deadlocks, since the alternative is processes that can't coordinate and end up stuck in livelocks or with unbounded memory use etc.
I'm not sure I understand your concern here. Perhaps I used the word "sharing" too ambiguously? By "sharing" I mean that the two actors have read access to something that at least one of them can modify. If they both only have read-only access then it's effectively the same as if they are not sharing.
Right. What I mean is that you *can* share very simple "data" under the form of synchronization primitives. You may want to synchronize your interpreters even they don't share user-visible memory areas. The point of synchronization is not only to avoid memory corruption but also to regulate and orchestrate processing amongst multiple workers (for example processes or interpreters). For example, a semaphore is an easy way to implement "I want no more than N workers to do this thing at the same time" ("this thing" can be something such as disk I/O).
It's fairly reasonable to implement a mutex using a CSP-style unbuffered channel (send = acquire, receive = release). And the same trick turns a channel with a fixed-size buffer into a bounded semaphore. It won't be as efficient as a modern specialized mutex implementation, of course, but it's workable. Unfortunately while technically you can construct a buffered channel out of an unbuffered channel, the construction's pretty unreasonable (it needs two dedicated threads per channel). -n -- Nathaniel J. Smith -- https://vorpus.org
On Mon, 25 Sep 2017 17:42:02 -0700 Nathaniel Smith <njs@pobox.com> wrote:
On Sat, Sep 23, 2017 at 2:45 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
As to "running_interpreters()" and "idle_interpreters()", I'm not sure what the benefit would be. You can compose either list manually with a simple comprehension:
[interp for interp in interpreters.list_all() if interp.is_running()] [interp for interp in interpreters.list_all() if not interp.is_running()]
There is a inherit race condition in doing that, at least if interpreters are running in multiple threads (which I assume is going to be the overly dominant usage model). That is why I'm proposing all three variants.
There's a race condition no matter what the API looks like -- having a dedicated running_interpreters() lets you guarantee that the returned list describes the set of interpreters that were running at some moment in time, but you don't know when that moment was and by the time you get the list, it's already out-of-date.
Hmm, you're right of course.
Likewise, queue.Queue.send() supports blocking, in addition to providing a put_nowait() method.
queue.Queue.put() never blocks in the usual case (*), which is of an unbounded queue. Only bounded queues (created with an explicit non-zero max_size parameter) can block in Queue.put().
(*) and therefore also never deadlocks :-)
Unbounded queues also introduce unbounded latency and memory usage in realistic situations.
This doesn't seem to pose much a problem in common use cases, though. How many Python programs have you seen switch from an unbounded to a bounded Queue to solve this problem? Conversely, choosing a buffer size is tricky. How do you know up front which amount you need? Is a fixed buffer size even ok or do you want it to fluctuate based on the current conditions? And regardless, my point was that a buffer is desirable. That send() may block when the buffer is full doesn't change that it won't block in the common case.
There's a reason why sockets always have bounded buffers -- it's sometimes painful, but the pain is intrinsic to building distributed systems, and unbounded buffers just paper over it.
Papering over a problem is sometimes the right answer actually :-) For example, most Python programs assume memory is unbounded... If I'm using a queue or channel to push events to a logging system, should I really block at every send() call? Most probably I'd rather run ahead instead.
Also, suddenly an interpreter's ability to exploit CPU time is dependent on another interpreter's ability to consume data in a timely manner (what if the other interpreter is e.g. stuck on some disk I/O?). IMHO it would be better not to have such coupling.
A small buffer probably is useful in some cases, yeah -- basically enough to smooth out scheduler jitter.
That's not about scheduler jitter, but catering for activities which occur at inherently different speed or rhythms. Requiring things run in lockstep removes a lot of flexibility and makes it harder to exploit CPU resources fully.
I expect more often than expected, in complex systems :-) For example, you could have a recv() loop that also from time to time send()s some data on another queue, depending on what is received. But if that send()'s recipient also has the same structure (a recv() loop which send()s from time to time), then it's easy to imagine to two getting in a deadlock.
You kind of want to be able to create deadlocks, since the alternative is processes that can't coordinate and end up stuck in livelocks or with unbounded memory use etc.
I am not advocating we make it *impossible* to create deadlocks; just saying we should not make them more *likely* than they need to.
I'm not sure I understand your concern here. Perhaps I used the word "sharing" too ambiguously? By "sharing" I mean that the two actors have read access to something that at least one of them can modify. If they both only have read-only access then it's effectively the same as if they are not sharing.
Right. What I mean is that you *can* share very simple "data" under the form of synchronization primitives. You may want to synchronize your interpreters even they don't share user-visible memory areas. The point of synchronization is not only to avoid memory corruption but also to regulate and orchestrate processing amongst multiple workers (for example processes or interpreters). For example, a semaphore is an easy way to implement "I want no more than N workers to do this thing at the same time" ("this thing" can be something such as disk I/O).
It's fairly reasonable to implement a mutex using a CSP-style unbuffered channel (send = acquire, receive = release). And the same trick turns a channel with a fixed-size buffer into a bounded semaphore. It won't be as efficient as a modern specialized mutex implementation, of course, but it's workable.
We are drifting away from the point I was trying to make here. I was pointing out that the claim that nothing can be shared is a lie. If it's possible to share a small datum (a synchronized counter aka semaphore) between processes, certainly there's no technical reason that should prevent it between interpreters. By the way, I do think efficiency is a concern here. Otherwise subinterpreters don't even have a point (just use multiprocessing).
Unfortunately while technically you can construct a buffered channel out of an unbuffered channel, the construction's pretty unreasonable (it needs two dedicated threads per channel).
And the reverse is quite cumbersome as well. So we should favour the construct that's more convenient for users, or provide both. Regards Antoine.
On 26 September 2017 at 17:04, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Mon, 25 Sep 2017 17:42:02 -0700 Nathaniel Smith <njs@pobox.com> wrote:
Unbounded queues also introduce unbounded latency and memory usage in realistic situations.
This doesn't seem to pose much a problem in common use cases, though. How many Python programs have you seen switch from an unbounded to a bounded Queue to solve this problem?
Conversely, choosing a buffer size is tricky. How do you know up front which amount you need? Is a fixed buffer size even ok or do you want it to fluctuate based on the current conditions?
And regardless, my point was that a buffer is desirable. That send() may block when the buffer is full doesn't change that it won't block in the common case.
It's also the case that unlike Go channels, which were designed from scratch on the basis of implementing pure CSP, Python has an established behavioural precedent in the APIs of queue.Queue and collections.deque: they're unbounded by default, and you have to opt in to making them bounded.
There's a reason why sockets always have bounded buffers -- it's sometimes painful, but the pain is intrinsic to building distributed systems, and unbounded buffers just paper over it.
Papering over a problem is sometimes the right answer actually :-) For example, most Python programs assume memory is unbounded...
If I'm using a queue or channel to push events to a logging system, should I really block at every send() call? Most probably I'd rather run ahead instead.
While the article title is clickbaity, http://www.jtolds.com/writing/2016/03/go-channels-are-bad-and-you-should-fee... actually has a good discussion of this point. Search for "compose" to find the relevant section ("Channels don’t compose well with other concurrency primitives"). The specific problem cited is that only offering unbuffered or bounded-buffer channels means that every send call becomes a potential deadlock scenario, as all that needs to happen is for you to be holding a different synchronisation primitive when the send call blocks.
Also, suddenly an interpreter's ability to exploit CPU time is dependent on another interpreter's ability to consume data in a timely manner (what if the other interpreter is e.g. stuck on some disk I/O?). IMHO it would be better not to have such coupling.
A small buffer probably is useful in some cases, yeah -- basically enough to smooth out scheduler jitter.
That's not about scheduler jitter, but catering for activities which occur at inherently different speed or rhythms. Requiring things run in lockstep removes a lot of flexibility and makes it harder to exploit CPU resources fully.
The fact that the proposal now allows for M:N sender:receiver relationships (just as queue.Queue does with threads) makes that problem worse, since you may now have variability not only on the message consumption side, but also on the message production side. Consider this example where you have an event processing thread pool that we're attempting to isolate from blocking IO by using channels rather than coroutines. Desired flow: 1. Listener thread receives external message from socket 2. Listener thread files message for processing on receive channel 3. Listener thread returns to blocking on the receive socket 4. Processing thread picks up message from receive channel 5. Processing thread processes message 6. Processing thread puts reply on the send channel 7. Sending thread picks up message from send channel 8. Sending thread makes a blocking network send call to transmit the message 9. Sending thread returns to blocking on the send channel When queue.Queue is used to pass the messages between threads, such an arrangement will be effectively non-blocking as long as the send rate is greater than or equal to the receive rate. However, the GIL means it won't exploit all available cores, even if we create multiple processing threads: you have to switch to multiprocessing for that, with all the extra overhead that entails. So I see the essential premise of PEP 554 as being to ask the question "If each of these threads was running its own *interpreter*, could we use Sans IO style protocols with interpreter channels to separate internally "synchronous" processing threads from separate IO threads operating at system boundaries, without having to make the entire application pervasively asynchronous?" If channels are an unbuffered blocking primitive, then we don't get that benefit: even when there are additional receive messages to be processed, the processing thread will block until the previous send has completed. Switching the listener and sender threads over to asynchronous IO would help with that, but they'd also end up having to implement their own message buffering to manage the lack of buffering in the core channel primitive. By contrast, if the core channels are designed to offer an unbounded buffer by default, then you can get close-to-CSP semantics just by setting the buffer size to 1 (it's still not exactly CSP, since that has a buffer size of 0, but you at least get the semantics of having to alternate sending and receiving of messages).
I expect more often than expected, in complex systems :-) For example, you could have a recv() loop that also from time to time send()s some data on another queue, depending on what is received. But if that send()'s recipient also has the same structure (a recv() loop which send()s from time to time), then it's easy to imagine to two getting in a deadlock.
You kind of want to be able to create deadlocks, since the alternative is processes that can't coordinate and end up stuck in livelocks or with unbounded memory use etc.
I am not advocating we make it *impossible* to create deadlocks; just saying we should not make them more *likely* than they need to.
Right, and I think the queue.Queue and collections.deque model works well for that, since you can start introducing queue bounds to propagate backpressure through a system if you're seeing undesirable memory growth.
It's fairly reasonable to implement a mutex using a CSP-style unbuffered channel (send = acquire, receive = release). And the same trick turns a channel with a fixed-size buffer into a bounded semaphore. It won't be as efficient as a modern specialized mutex implementation, of course, but it's workable.
We are drifting away from the point I was trying to make here. I was pointing out that the claim that nothing can be shared is a lie. If it's possible to share a small datum (a synchronized counter aka semaphore) between processes, certainly there's no technical reason that should prevent it between interpreters.
By the way, I do think efficiency is a concern here. Otherwise subinterpreters don't even have a point (just use multiprocessing).
Agreed, and I think the interaction between the threading module and the interpreters module is one we're going to have to explicitly call out as being covered by the provisional status of the interpreters module, as I think it could be incredibly valuable to be able to send at least some threading objects through channels, and have them be an interpreter-specific reference to a common underlying sync primitive.
Unfortunately while technically you can construct a buffered channel out of an unbuffered channel, the construction's pretty unreasonable (it needs two dedicated threads per channel).
And the reverse is quite cumbersome as well. So we should favour the construct that's more convenient for users, or provide both.
As noted above, I think consistency with design intuitions formed through the use of queue.Queue is also an important consideration. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Wed, Sep 27, 2017 at 1:26 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
It's also the case that unlike Go channels, which were designed from scratch on the basis of implementing pure CSP,
FWIW, Go's channels (and goroutines) don't implement pure CSP. They provide a variant that the Go authors felt was more in-line with the language's flavor. The channels in the PEP aim to support a more pure implementation.
Python has an established behavioural precedent in the APIs of queue.Queue and collections.deque: they're unbounded by default, and you have to opt in to making them bounded.
Right. That's part of why I'm leaning toward support for buffered channels.
While the article title is clickbaity, http://www.jtolds.com/writing/2016/03/go-channels-are-bad-and-you-should-fee... actually has a good discussion of this point. Search for "compose" to find the relevant section ("Channels don’t compose well with other concurrency primitives").
The specific problem cited is that only offering unbuffered or bounded-buffer channels means that every send call becomes a potential deadlock scenario, as all that needs to happen is for you to be holding a different synchronisation primitive when the send call blocks.
Yeah, that blog post was a reference for me as I was designing the PEP's channels.
The fact that the proposal now allows for M:N sender:receiver relationships (just as queue.Queue does with threads) makes that problem worse, since you may now have variability not only on the message consumption side, but also on the message production side.
Consider this example where you have an event processing thread pool that we're attempting to isolate from blocking IO by using channels rather than coroutines.
Desired flow:
1. Listener thread receives external message from socket 2. Listener thread files message for processing on receive channel 3. Listener thread returns to blocking on the receive socket
4. Processing thread picks up message from receive channel 5. Processing thread processes message 6. Processing thread puts reply on the send channel
7. Sending thread picks up message from send channel 8. Sending thread makes a blocking network send call to transmit the message 9. Sending thread returns to blocking on the send channel
When queue.Queue is used to pass the messages between threads, such an arrangement will be effectively non-blocking as long as the send rate is greater than or equal to the receive rate. However, the GIL means it won't exploit all available cores, even if we create multiple processing threads: you have to switch to multiprocessing for that, with all the extra overhead that entails.
So I see the essential premise of PEP 554 as being to ask the question "If each of these threads was running its own *interpreter*, could we use Sans IO style protocols with interpreter channels to separate internally "synchronous" processing threads from separate IO threads operating at system boundaries, without having to make the entire application pervasively asynchronous?"
+1
If channels are an unbuffered blocking primitive, then we don't get that benefit: even when there are additional receive messages to be processed, the processing thread will block until the previous send has completed. Switching the listener and sender threads over to asynchronous IO would help with that, but they'd also end up having to implement their own message buffering to manage the lack of buffering in the core channel primitive.
By contrast, if the core channels are designed to offer an unbounded buffer by default, then you can get close-to-CSP semantics just by setting the buffer size to 1 (it's still not exactly CSP, since that has a buffer size of 0, but you at least get the semantics of having to alternate sending and receiving of messages).
Yep, I came to the same conclusion.
By the way, I do think efficiency is a concern here. Otherwise subinterpreters don't even have a point (just use multiprocessing).
Agreed, and I think the interaction between the threading module and the interpreters module is one we're going to have to explicitly call out as being covered by the provisional status of the interpreters module, as I think it could be incredibly valuable to be able to send at least some threading objects through channels, and have them be an interpreter-specific reference to a common underlying sync primitive.
Agreed. I'll add a note to the PEP. -eric
On Mon, Sep 25, 2017 at 8:42 PM, Nathaniel Smith <njs@pobox.com> wrote:
It's fairly reasonable to implement a mutex using a CSP-style unbuffered channel (send = acquire, receive = release). And the same trick turns a channel with a fixed-size buffer into a bounded semaphore. It won't be as efficient as a modern specialized mutex implementation, of course, but it's workable.
Unfortunately while technically you can construct a buffered channel out of an unbuffered channel, the construction's pretty unreasonable (it needs two dedicated threads per channel).
Yeah, if threading's synchronization primitives make sense between interpreters then we'll add direct support. Using channels for that isn't a good option. -eric
After having looked it over, I'm leaning toward supporting buffering, as well as not blocking by default. Neither adds much complexity to the implementation. On Sat, Sep 23, 2017 at 5:45 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Fri, 22 Sep 2017 19:09:01 -0600 Eric Snow <ericsnowcurrently@gmail.com> wrote:
send() blocking until someone else calls recv() is not only bad for performance,
What is the performance problem?
Intuitively, there must be some kind of context switch (interpreter switch?) at each send() call to let the other end receive the data, since you don't have any internal buffering.
There would be an internal size-1 buffer.
(FWIW, CSP provides rigorous guarantees about deadlock detection (which Go leverages), though I'm not sure how much benefit that can offer such a dynamic language as Python.)
Hmm... deadlock detection is one thing, but when detected you must still solve those deadlock issues, right?
Yeah, I haven't given much thought into how we could leverage that capability but my gut feeling is that we won't have much opportunity to do so. :)
I'm not sure I understand your concern here. Perhaps I used the word "sharing" too ambiguously? By "sharing" I mean that the two actors have read access to something that at least one of them can modify. If they both only have read-only access then it's effectively the same as if they are not sharing.
Right. What I mean is that you *can* share very simple "data" under the form of synchronization primitives. You may want to synchronize your interpreters even they don't share user-visible memory areas. The point of synchronization is not only to avoid memory corruption but also to regulate and orchestrate processing amongst multiple workers (for example processes or interpreters). For example, a semaphore is an easy way to implement "I want no more than N workers to do this thing at the same time" ("this thing" can be something such as disk I/O).
I'm still not convinced that sharing synchronization primitives is important enough to be worth including it in the PEP. It can be added later, or via an extension module in the meantime. To that end, I'll add a mechanism to the PEP for third-party types to indicate that they can be passed through channels. Something like "obj.__channel_support__ = True". -eric
On Mon, 2 Oct 2017 22:15:01 -0400 Eric Snow <ericsnowcurrently@gmail.com> wrote:
I'm still not convinced that sharing synchronization primitives is important enough to be worth including it in the PEP. It can be added later, or via an extension module in the meantime. To that end, I'll add a mechanism to the PEP for third-party types to indicate that they can be passed through channels. Something like "obj.__channel_support__ = True".
How would that work? If it's simply a matter of flipping a bit, why don't we do it for all objects? Regards Antoine.
On Tue, Oct 3, 2017 at 5:00 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Mon, 2 Oct 2017 22:15:01 -0400 Eric Snow <ericsnowcurrently@gmail.com> wrote:
I'm still not convinced that sharing synchronization primitives is important enough to be worth including it in the PEP. It can be added later, or via an extension module in the meantime. To that end, I'll add a mechanism to the PEP for third-party types to indicate that they can be passed through channels. Something like "obj.__channel_support__ = True".
How would that work? If it's simply a matter of flipping a bit, why don't we do it for all objects?
The type would also have to be safe to share between interpreters. :) Eventually I'd like to make that work for all immutable objects (and immutable containers thereof), but until then each type must be adapted individually. The PEP starts off with just Bytes. -eric
On Tue, 3 Oct 2017 08:36:55 -0600 Eric Snow <ericsnowcurrently@gmail.com> wrote:
On Tue, Oct 3, 2017 at 5:00 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Mon, 2 Oct 2017 22:15:01 -0400 Eric Snow <ericsnowcurrently@gmail.com> wrote:
I'm still not convinced that sharing synchronization primitives is important enough to be worth including it in the PEP. It can be added later, or via an extension module in the meantime. To that end, I'll add a mechanism to the PEP for third-party types to indicate that they can be passed through channels. Something like "obj.__channel_support__ = True".
How would that work? If it's simply a matter of flipping a bit, why don't we do it for all objects?
The type would also have to be safe to share between interpreters. :)
But what does it mean to be safe to share, while the exact degree and nature of the isolation between interpreters (and also their concurrent execution) is unspecified? I think we need a sharing protocol, not just a flag. We also need to think carefully about that protocol, so that it does not imply unnecessary memory copies. Therefore I think the protocol should be something like the buffer protocol, that allows to acquire and release a set of shared memory areas, but without imposing any semantics onto those memory areas (each type implementing its own semantics). And there needs to be a dedicated reference counting for object shares, so that the original object can be notified when all its shares have vanished. Regards Antoine.
On 03Oct2017 0755, Antoine Pitrou wrote:
On Tue, 3 Oct 2017 08:36:55 -0600 Eric Snow <ericsnowcurrently@gmail.com> wrote:
On Tue, Oct 3, 2017 at 5:00 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Mon, 2 Oct 2017 22:15:01 -0400 Eric Snow <ericsnowcurrently@gmail.com> wrote:
I'm still not convinced that sharing synchronization primitives is important enough to be worth including it in the PEP. It can be added later, or via an extension module in the meantime. To that end, I'll add a mechanism to the PEP for third-party types to indicate that they can be passed through channels. Something like "obj.__channel_support__ = True".
How would that work? If it's simply a matter of flipping a bit, why don't we do it for all objects?
The type would also have to be safe to share between interpreters. :)
But what does it mean to be safe to share, while the exact degree and nature of the isolation between interpreters (and also their concurrent execution) is unspecified?
I think we need a sharing protocol, not just a flag.
The easiest such protocol is essentially: * an object can represent itself as bytes (e.g. generate a bytes object representing some global token, such as a kernel handle or memory address) * those bytes are sent over the standard channel * the object can instantiate itself from those bytes (e.g. wrap the existing handle, create a memoryview over the same block of memory, etc.) * cross-interpreter refcounting is either ignored (because the kernel is refcounting the resource) or manual (by including more shared info in the token) Since this is trivial to implement over the basic bytes channel, and doesn't even require a standard protocol except for convenience, Eric decided to avoid blocking the core functionality on this. I'm inclined to agree - get the basic functionality supported and let people build on it before we try to lock down something we don't fully understand yet. About the only thing that seems to be worth doing up-front is some sort of pending-call callback mechanism between interpreters, but even that doesn't need to block the core functionality (you can do it trivially with threads and another channel right now, and there's always room to make something more efficient later). There are plenty of smart people out there who can and will figure out the best way to design this. By giving them the tools and the ability to design something awesome, we're more likely to get something awesome than by committing to a complete design now. Right now, they're all blocked on the fact that subinterpreters are incredibly hard to start running, let alone experiment with. Eric's PEP will fix that part and enable others to take it from building blocks to powerful libraries. Cheers, Steve
Hi Eric,
To make this work, the mutable shared state will be managed by the Python runtime, not by any of the interpreters. Initially we will support only one type of objects for shared state: the channels provided by create_channel(). Channels, in turn, will carefully manage passing objects between interpreters. [0]
Would it make sense to make the default channel type explicit, something like ``create_channel(bytes)`` ? Thanks in advance, --francis [0] https://www.python.org/dev/peps/pep-0554/
On 23 Sep 2017, at 3:09, Eric Snow wrote:
[...]
``list_all()``::
Return a list of all existing interpreters.
See my naming proposal in the previous thread.
Sorry, your previous comment slipped through the cracks. You suggested:
As for the naming, let's make it both unconfusing and explicit? How about three functions: `all_interpreters()`, `running_interpreters()` and `idle_interpreters()`, for example?
As to "all_interpreters()", I suppose it's the difference between "interpreters.all_interpreters()" and "interpreters.list_all()". To me the latter looks better.
But in most cases when Python returns a container (list/dict/iterator) of things, the name of the function/method is the name of the things, not the name of the container, i.e. we have sys.modules, dict.keys, dict.values etc.. Or if the collection of things itself has a name, it is that name, i.e. os.environ, sys.path etc. Its a little bit unfortunate that the name of the module would be the same as the name of the function, but IMHO interpreters() would be better than list().
As to "running_interpreters()" and "idle_interpreters()", I'm not sure what the benefit would be. You can compose either list manually with a simple comprehension:
[interp for interp in interpreters.list_all() if interp.is_running()] [interp for interp in interpreters.list_all() if not interp.is_running()]
Servus, Walter
On 14 September 2017 at 11:44, Eric Snow <ericsnowcurrently@gmail.com> wrote:
Examples ========
Run isolated code -----------------
::
interp = interpreters.create() print('before') interp.run('print("during")') print('after')
A few more suggestions for examples: Running a module: main_module = mod_name interp.run(f"import runpy; runpy.run_module({main_module!r})") Running as script (including zip archives & directories): main_script = path_name interp.run(f"import runpy; runpy.run_path({main_script!r})") Running in a thread pool executor: interps = [interpreters.create() for i in range(5)] with concurrent.futures.ThreadPoolExecutor(max_workers=len(interps)) as pool: print('before') for interp in interps: pool.submit(interp.run, 'print("starting"); print("stopping")' print('after') That last one is prompted by the questions about the benefits of keeping the notion of an interpreter state distinct from the notion of a main thread (it allows a single "MainThread" object to be mapped to different OS level threads at different points in time, which means it's easier to combine with existing constructs for managing OS level thread pools). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
participants (10)
-
Antoine Pitrou
-
Eric Snow
-
francismb
-
Koos Zevenhoven
-
MRAB
-
Nathaniel Smith
-
Nick Coghlan
-
Steve Dower
-
Walter Dörwald
-
Yury Selivanov