On Thu, Sep 14, 2017 at 5:44 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 14 September 2017 at 15:27, Nathaniel Smith <njs@pobox.com> wrote:
I don't get it. With bytes, you can either share objects or copy them and the user can't tell the difference, so you can change your mind later if you want. But memoryviews require some kind of cross-interpreter strong reference to keep the underlying buffer object alive. So if you want to minimize object sharing, surely bytes are more future-proof.
Not really, because the only way to ensure object separation (i.e no refcounted objects accessible from multiple interpreters at once) with a bytes-based API would be to either:
1. Always copy (eliminating most of the low overhead communications benefits that subinterpreters may offer over multiple processes) 2. Make the bytes implementation more complicated by allowing multiple bytes objects to share the same underlying storage while presenting as distinct objects in different interpreters 3. Make the output on the receiving side not actually a bytes object, but instead a view onto memory owned by another object in a different interpreter (a "memory view", one might say)
And yes, using memory views for this does mean defining either a subclass or a mediating object that not only keeps the originating object alive until the receiving memoryview is closed, but also retains a reference to the originating interpreter so that it can switch to it when it needs to manipulate the source object's refcount or call one of the buffer methods.
Yury and I are fine with that, since it means that either the sender *or* the receiver can decide to copy the data (e.g. by calling bytes(obj) before sending, or bytes(view) after receiving), and in the meantime, the object holding the cross-interpreter view knows that it needs to switch interpreters (and hence acquire the sending interpreter's GIL) before doing anything with the source object.
The reason we're OK with this is that it means that only reading a new message from a channel (i.e creating a cross-interpreter view) or discarding a previously read message (i.e. closing a cross-interpreter view) will be synchronisation points where the receiving interpreter necessarily needs to acquire the sending interpreter's GIL.
By contrast, if we allow an actual bytes object to be shared, then either every INCREF or DECREF on that bytes object becomes a synchronisation point, or else we end up needing some kind of secondary per-interpreter refcount where the interpreter doesn't drop its shared reference to the original object in its source interpreter until the internal refcount in the borrowing interpreter drops to zero.
Ah, that makes more sense. I am nervous that allowing arbitrary memoryviews gives a *little* more power than we need or want. I like that the current API can reasonably be emulated using subprocesses -- it opens up the door for backports, compatibility support on language implementations that don't support subinterpreters, direct benchmark comparisons between the two implementation strategies, etc. But if we allow arbitrary memoryviews, then this requires that you can take (a) an arbitrary object, not specified ahead of time, and (b) provide two read-write views on it in separate interpreters such that modifications made in one are immediately visible in the other. Subprocesses can do one or the other -- they can copy arbitrary data, and if you warn them ahead of time when you allocate the buffer, they can do real zero-copy shared memory. But the combination is really difficult. It'd be one thing if this were like a key feature that gave subinterpreters an advantage over subprocesses, but it seems really unlikely to me that a library won't know ahead of time when it's filling in a buffer to be transferred, and if anything it seems like we'd rather not expose read-write shared mappings in any case. It's extremely non-trivial to do right [1]. tl;dr: let's not rule out a useful implementation strategy based on a feature we don't actually need. One alternative would be your option (3) -- you can put bytes in and get memoryviews out, and since bytes objects are immutable it's OK. [1] https://en.wikipedia.org/wiki/Memory_model_(programming)
Handling an exception --------------------- It would also be reasonable to simply not return any value/exception from run() at all, or maybe just a bool for whether there was an unhandled exception. Any high level API is going to be injecting code on both sides of the interpreter boundary anyway, so it can do whatever exception and traceback translation it wants to.
So any more detailed response would *have* to come back as a channel message?
That sounds like a reasonable option to me, too, especially since module level code doesn't have a return value as such - you can really only say "it raised an exception (and this was the exception it raised)" or "it reached the end of the code without raising an exception".
Given that, I think subprocess.run() (with check=False) is the right API precedent here: https://docs.python.org/3/library/subprocess.html#subprocess.run
That always returns subprocess.CompletedProcess, and then you can call "cp.check_returncode()" to get it to raise subprocess.CalledProcessError for non-zero return codes.
For interpreter.run(), we could keep the initial RunResult *really* simple and only report back:
* source: the source code passed to run() * shared: the keyword args passed to run() (name chosen to match functools.partial) * completed: completed execution without raising an exception? (True if yes, False otherwise)
Whether or not to report more details for a raised exception, and provide some mechanism to reraise it in the calling interpreter could then be deferred until later.
The subprocess.run() comparison does make me wonder whether this might be a more future-proof signature for Interpreter.run() though:
def run(source_str, /, *, channels=None): ...
That way channels can be a namespace *specifically* for passing in channels, and can be reported as such on RunResult. If we decide to allow arbitrary shared objects in the future, or add flag options like "reraise=True" to reraise exceptions from the subinterpreter in the current interpreter, we'd have that ability, rather than having the entire potential keyword namespace taken up for passing shared objects.
Would channels be a dict, or...? -n -- Nathaniel J. Smith -- https://vorpus.org