[Python-ideas] Async API: some code to review

Guido van Rossum guido at python.org
Wed Nov 7 16:19:32 CET 2012


Glyph and three other Twisted developers visited me yesterday. All is well.
We're behind in reporting -- I have a variety of trips and other activities
coming up, but I am still very much planning to act on what we discussed.
(And no, they didn't convince me to add Twisted to the stdlib. :-)

--Guido


On Wed, Nov 7, 2012 at 1:11 AM, Devin Jeanpierre <jeanpierreda at gmail.com>wrote:

> It's been a week, and nobody has responded to Glyph's email. I don't
> think I know enough to agree or disagree with what he said, but it was
> well-written and it looked important. Also, Glyph has a lot of
> experience with this sort of thing, and it would be a shame if he was
> discouraged by the lack of response. We can't really expect people to
> contribute if their opinions are ignored.
>
> Can relevant people please take another look at his post?
>
> -- Devin
>
> On Wed, Oct 31, 2012 at 6:10 AM, Glyph <glyph at twistedmatrix.com> wrote:
> > Finally getting around to this one...
> >
> > I am sorry if I'm repeating any criticism that has already been rehashed
> in
> > this thread.  There is really a deluge of mail here and I can't keep up
> with
> > it.  I've skimmed some of it and avoided or noted things that I did see
> > mentioned, but I figured I should write up something before next week.
> >
> > To make a long story short, my main points here are:
> >
> > I think tulip unfortunately has a lot of the problems I tried to
> describe in
> > earlier messages,
> > it would be really great if we could have a core I/O interface that we
> could
> > use for interoperability with Twisted before bolting a requirement for
> > coroutine trampolines on to everything,
> > twisted-style protocol/transport separation is really important and this
> > should not neglect it.  As I've tried to illustrate in previous
> messages, an
> > API where applications have to call send() or recv() is just not going to
> > behave intuitively in edge cases or perform well,
> > I know it's a prototype, but this isn't such an unexplored area that it
> > should be developed without TDD: all this code should both have tests and
> > provide testing support to show how applications that use it can be
> tested
> > the scheduler module needs some example implementation of something like
> > Twisted's gatherResults for me to critique its expressiveness; it looks
> like
> > it might be missing something in the area of one task coordinating
> multiple
> > others but I can't tell
> >
> >
> > On Oct 28, 2012, at 4:52 PM, Guido van Rossum <guido at python.org>
> wrote:
> >
> > The pollster has a very simple API: add_reader(fd, callback, *args),
> >
> > add_writer(<ditto>), remove_reader(fd), remove_writer(fd), and
> > poll(timeout) -> list of events. (fd means file descriptor.) There's
> > also pollable() which just checks if there are any fds registered. My
> > implementation requires fd to be an int, but that could easily be
> > extended to support other types of event sources.
> >
> >
> > I don't see how that is.  All of the mechanisms I would leverage within
> > Twisted to support other event sources are missing (e.g.: abstract
> > interfaces for those event sources).  Are you saying that a totally
> > different pollster could just accept a different type to add_reader, and
> not
> > an integer?  If so, how would application code know how to construct
> > something else.
> >
> > I'm not super happy that I have parallel reader/writer APIs, but passing
> a
> > separate read/write flag didn't come out any more elegant, and I don't
> > foresee other operation types (though I may be wrong).
> >
> >
> > add_reader and add_writer is an important internal layer of the API for
> > UNIX-like operating systems, but the design here is fundamentally flawed
> in
> > that application code (e.g. echosvr.py) needs to import concrete
> > socket-handling classes like SocketTransport and BufferedReader in order
> to
> > synthesize a transport.  These classes might need to vary their behavior
> > significantly between platforms, and application code should not be
> > manipulating them unless there is a serious low-level need to.
> >
> > It looks like you've already addressed the fact that some transports
> need to
> > be platform-specific.  That's not quite accurate, unless you take a very
> > broad definition of "platform".  In Twisted, the basic socket-based TCP
> > transport is actually supported across all platforms; but some other
> *APIs*
> > (well, let's be honest, right now, just IOCP, but there have been others,
> > such as java's native I/O APIs under Jython, in the past).
> >
> > You have to ask the "pollster" (by which I mean: reactor) for transport
> > objects, because different multiplexing mechanisms can require different
> I/O
> > APIs, even for basic socket I/O.  This is why I keep talking about IOCP.
> > It's not that Windows is particularly great, but that the IOCP API, if
> used
> > correctly, is fairly alien, and is a good proxy for other use-cases which
> > are less direct to explain, like interacting with GUI libraries where you
> > need to interact with the GUI's notion of a socket to get notifications,
> > rather than a raw FD.  (GUI libraries often do this because they have to
> > support Windows and therefore IOCP.)  Others in this thread have already
> > mentioned the fact that ZeroMQ requires the same sort of affordance.
>  This
> > is really a design error on 0MQ's part, but, you have to deal with it
> anyway
> > ;-).
> >
> > More importantly, concretely tying everything to sockets is just bad
> design.
> > You want to be able to operate on pipes and PTYs (which need to call
> read(),
> > or, a bunch of gross ioctl()s and then read(), not recv()).  You want to
> be
> > able to able to operate on these things in unit tests without involving
> any
> > actual file descriptors or syscalls.  The higher level of abstraction
> makes
> > regular application code a lot shorter, too: I was able to compress
> > echosvr.py down to 22 lines by removing all the comments and logging and
> > such, but that is still more than twice as long as the (9 line) echo
> server
> > example on the front page of <http://twistedmatrix.com/trac/>.  It's
> closer
> > in length to the (19 line) full line-based publish/subscribe protocol
> over
> > on the third tab.
> >
> > Also, what about testing? You want to be able to simulate the order of
> > responses of multiple syscalls to coerce your event-driven program to
> > receive its events in different orders.  One of the big advantages of
> event
> > driven programming is that everything's just a method call, so your unit
> > tests can just call the methods to deliver data to your program and see
> what
> > it does, without needing to have a large, elaborate simulation edifice to
> > pretend to be a socket.  But, once you mix in the magic of the generator
> > trampoline, it's somewhat hard to assemble your own working environment
> > without some kind of test event source; at least, it's not clear to me
> how
> > to assemble a Task without having a pollster anywhere, or how to make my
> own
> > basic pollster for testing.
> >
> > The event loop has two basic ways to register callbacks:
> > call_soon(callback, *args) causes callback(*args) to be called the
> > next time the event loop runs; call_later(delay, callback, *args)
> > schedules a callback at some time (relative or absolute) in the
> > future.
> >
> >
> > "relative or absolute" is hiding the whole monotonic-clocks discussion
> > behind a simple phrase, but that probably does not need to be resolved
> > here... I'll let you know if we ever figure it out :).
> >
> > sockets.py: http://code.google.com/p/tulip/source/browse/sockets.py
> >
> > This implements some internet primitives using the APIs in
> > scheduling.py (including block_r() and block_w()). I call them
> > transports but they are different from transports Twisted; they are
> > closer to idealized sockets. SocketTransport wraps a plain socket,
> > offering recv() and send() methods that must be invoked using yield
> > from.
> >
> >
> > I feel I should note that these methods behave inconsistently; send()
> > behaves as sendall(), re-trying its writes until it receives a full
> buffer,
> > but recv() may yield a short read.
> >
> > (But most importantly, block_r and block_w are insufficient as
> primitives;
> > you need a separate pollster that uses write_then_block(data) and
> > read_then_block() too, which may need to dispatch to WSASend/WSARecv or
> > WriteFile/ReadFile.)
> >
> > SslTransport wraps an ssl socket (luckily in Python 2.6 and up,
> > stdlib ssl sockets have good async support!).
> >
> >
> > stdlib ssl sockets have async support that makes a number of UNIX-y
> > assumptions.  The wrap_socket trick doesn't work with IOCP, because the
> I/O
> > operations are initiated within the SSL layer, and therefore can't be
> > associated with a completion port, so they won't cause a queued
> completion
> > status trigger and therefore won't wake up the loop.  This plagued us for
> > many years within Twisted and has only relatively recently been fixed:
> > <http://tm.tl/593>.
> >
> > Since probably 99% of the people on this list don't actually give a crap
> > about Windows, let me give a more practical example: you can't do SSL
> over a
> > UNIX pipe.  Off the top of my head, this means you can't write a
> > command-line tool to encrypt a connection via a shell pipeline, but there
> > are many other cases where you'd expect to be able to get arbitrary I/O
> over
> > stdout.
> >
> > It's reasonable, of course, for lots of Python applications to not care
> > about high-performance, high-concurrency SSL on Windows,; select() works
> > okay for many applications on Windows.  And most SSL happens on sockets,
> not
> > pipes, hence the existence of the OpenSSL API that the stdlib ssl module
> > exposes for wrapping sockets.  But, as I'll explain in a moment, this is
> one
> > reason that it's important to be able to give your code a turbo boost
> with
> > Twisted (or other third-party extensions) once you start encountering
> > problems like this.
> >
> > I don't particularly care about the exact abstractions in this module;
> > they are convenient and I was surprised how easy it was to add SSL,
> > but still these mostly serve as somewhat realistic examples of how to
> > use scheduling.py.
> >
> >
> > This is where I think we really differ.
> >
> > I think that the whole attempt to build a coroutine scheduler at the low
> > level is somewhat misguided and will encourage people to write
> misleading,
> > sloppy, incorrect programs that will be tricky to debug (although, to be
> > fair, not quite as tricky as even more misleading/sloppy/incorrect
> > multi-threaded ones).  However, I'm more than happy to agree to disagree
> on
> > this point: clearly you think that forests of yielding coroutines are a
> big
> > part of the future of Python.  Maybe you're even right to do so, since I
> > have no interest in adding language features, whereas if you hit a rough
> > edge in 'yield' syntax you can sand it off rather than living with it.  I
> > will readily concede that 'yield from' and 'return' are nicer than the
> > somewhat ad-hoc idioms we ended up having to contend with in the current
> > iteration of @inlineCallbacks.  (Except for the exit-at-a-distance
> problem,
> > which it doesn't seem that return->StopIteration addresses - does this
> > happen, with PEP-380 generators?
> > <http://twistedmatrix.com/trac/ticket/4157>)
> >
> > What I'm not happy to disagree about is the importance of a good I/O
> > abstraction and interoperation layer.
> >
> > Twisted is not going away; there are oodles of good reasons that it's
> built
> > the way it is, as I've tried to describe in this and other messages, and
> > none of our plans for its future involve putting coroutine trampolines at
> > the core of the event loop; those are just fine over on the side with
> > inlineCallbacks.  However, lots of Python programmers are going to use
> what
> > you come up with.  They'd use it even if it didn't really work, just
> because
> > it's bundled in and it's convenient.  But I think it'll probably work
> fine
> > for many tasks, and it will appeal to lots of people new to event-driven
> I/O
> > because of the seductive deception of synchronous control flow and the
> > superiority to scheduling I/O operations with threads.
> >
> > What I think is really very important in the design of this new system
> is to
> > present an API whereby:
> >
> > if someone wants to write a basic protocol or data-format parser for the
> > stdlib, it should be easy to write it as a feed parser without needing
> > generator coroutines (for example, if they're pushing data into a C
> library,
> > they shouldn't have to write a while loop that calls recv, they should be
> > able to just transform some data callback into Python into some data
> > callback in C; it should be able to leverage tulip without much more
> work,
> > if users of tulip (read; the stdlib) need access to some functionality
> > implemented within Twisted, like an event-driven DNS client that is more
> > scalable than getaddrinfo, they can call into Twisted without re-writing
> > their entire program,
> > if users of Twisted need to invoke some functionality implemented on top
> of
> > tulip, they can construct a task and weave in a scheduler, similarly
> without
> > re-writing much,
> > if users of tulip want to just use Twisted to get better performance or
> > reliability than the built-in stdlib multiplexor, they ideally shouldn't
> > have to change anything, just run it with a different import line or
> > something, and
> > if (when) users of tulip realize that their generators have devolved
> into a
> > mess of spaghetti ;-) and they need to migrate to Twisted-style
> event-driven
> > callbacks and maybe some formal state machines or generated parsers to
> deal
> > with their inputs, that process can be done incrementally and not in one
> > giant shoot-the-moon effort which will make them hate Twisted.
> >
> >
> > As an added bonus, such an API would provide a great basis for Tornado
> and
> > Twisted to interoperate.
> >
> > It would also be nice to have a more discrete I/O layer to insulate
> > application code from common foibles like the fact that, for example, if
> you
> > call send() in tulip multiple times but forget to 'yield from ...send()',
> > you may end up writing interleaved garbage on the connection, then
> raising
> > an assertion error, but only if there's a sufficient quantity of data
> and it
> > needs to block; it will otherwise appear to work, leading to bugs that
> only
> > start happening when you are pushing large volumes of data through a
> system
> > at rates exceeding wire speed.  In other words, "only in production, only
> > during the holiday season, only during traffic spikes, only when it's
> really
> > really important for the system to keep working".
> >
> > This is why I think that step 1 here needs to be a common low-level API
> for
> > event-triggered operations that does not have anything to do with
> > generators.  I don't want to stop you from doing interesting things with
> > generators, but I do really want to decouple the tasks so that their
> > responsibilities are not unnecessarily conflated.
> >
> > task.unblock() is a method; protocol.data_received is a method.  Both
> can be
> > invoked at the same level by an event loop.  Once that low-level event
> loop
> > is delivering data to that callback's satisfaction, the callbacks can
> > happily drive a coroutine scheduler, and the coroutine scheduler can have
> > much less of a deep integration with the I/O itself; it just needs some
> kind
> > of sentinel object (a Future, a Deferred) to keep track of what exactly
> it's
> > waiting for.
> >
> > I'm most interested in feedback on the design of polling.py and
> > scheduling.py, and to a lesser extent on the design of sockets.py;
> > main.py is just an example of how this style works out in practice.
> >
> >
> > It looks to me like there's a design error in scheduling.py with respect
> to
> > coordinating concurrent operations.  If you try to block on two
> operations
> > at once, you'll get an assertion error ('assert not self.blocked', in
> > block), so you can't coordinate two interesting I/O requests without
> > spawning a bunch of new Tasks and then having them unblock their parent
> Task
> > when they're done.  I may just be failing to imagine how one would
> implement
> > something like Twisted's gatherResults, but this looks like it would be
> > frustrating, tedious, and involve creating lots of extra objects and
> making
> > the scheduler do a bunch more work.
> >
> > Also, shouldn't there be a lot more real exceptions and a lot fewer
> > assertions in this code?
> >
> > Relatedly, add_reader/writer will silently stomp on a previous FD
> > registration, so if two tasks end up calling recv() on the same socket,
> it
> > doesn't look like there's any way to find out that they both did that.
>  It
> > looks like the first task to call it will just hang forever, and the
> second
> > one will "win"?  What are the intended semantics?
> >
> > Speaking from the perspective of I/O scheduling, it will also be
> thrashing
> > any stateful multiplexor with a ton of unnecessary syscalls.  A Twisted
> > protocol in normal operation just receiving data from a single
> connection,
> > using, let's say, a kqueue-based multiplexor will call kevent() once to
> > register interest, then kqueue() to block, and then just keep getting
> > data-available notifications and processing them unless some downstream
> > buffer fills up and the transport is told to pause producing data, at
> which
> > point another kevent() gets issued.  tulip, by contrast, will call
> kevent()
> > over and over again, removing and then re-adding its reader repeatedly
> for
> > every packet, since it can never know if someone is about to call recv()
> > again any time soon.  Once again, request/response is not the best model
> for
> > retrieving data from a transport; active connections need to be prepared
> to
> > receive more data at any time and not in response to any particular
> request.
> >
> > Finally, apologies for spelling / grammar errors; I didn't have a lot of
> > time to copy-edit.
> >
> > -glyph
> >
> > _______________________________________________
> > Python-ideas mailing list
> > Python-ideas at python.org
> > http://mail.python.org/mailman/listinfo/python-ideas
> >
>



-- 
--Guido van Rossum (python.org/~guido)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121107/ee02d7c7/attachment.html>


More information about the Python-ideas mailing list