[Python-ideas] PEP 3156 feedback

Tue Dec 18 19:02:05 CET 2012

On Tue, Dec 18, 2012 at 2:01 AM, Antoine Pitrou <solipsis at pitrou.net> wrote:
>
> Here is my own feedback on the in-progress PEP 3156. Please discard it
> if it's too early to give feedback :-))

Thank you, it's very to the point.

> Event loop API
> --------------
>
> I would like to say that I prefer Tornado's model: for each primitive
> provided by Tornado, you can pass an explicit Loop instance which you
> instantiated manually.
> There is no module function or policy object hiding this mechanism:
> it's simple, explicit and flexible (in other words: if you want a
> per-thread event loop, just do it yourself using TLS :-)).

It sounds though as if the explicit loop is optional, and still
defaults to some global default loop?

Having one global loop shared by multiple threads is iffy though. Only
one thread should be *running* the loop, otherwise the loop can' be
used as a mutual exclusion device. Worse, all primitives for adding
and removing callbacks/handlers must be made threadsafe, and then
basically the entire event loop becomes full of locks, which seems
wrong to me.

PEP 3156 lets the loop implementation choose the policy, which seems
safer than letting the user choose a policy that may or may not be
compatible with the loop's implementation.

Steve Dower keeps telling me that on Windows 8 the loop is built into
the OS. The Windows 8 loop also seems to be eager to use threads, so I
don't know if it can be relied on to serialize callbacks, but there is
probably a way to do that, or else the Python wrapper could add a lock
around callbacks.

> There are some requirements I've found useful:
>
> - being able to instantiate multiple loops, either at the same time or
>   serially (this is especially nice for unit tests; Twisted has to use
>   a dedicated test runner just because their reactor doesn't support
>   multiple instances or restarts)

Serially, for unit tests: definitely. The loop policy has
init_event_loop() for this, which forcibly creates a new loop.

At the same time: that seems to be an esoteric use case and not
favorable to interop with Twisted.

I want the loop to be mostly out of the way of the user, at least for
users using the high-level APIs (tasks, futures, transports,
protocols). In fact, just for this reason it may be better if the
protocol-creating methods had wrapper functions that just called
get_event_loop() and then called the corresponding method on the loop,
so the user code doesn't have to call get_event_loop() at all, ever
(or at least, if you call it, you should feel a slight tinge of guilt
about using a low-level API :-).

> - being able to stop a loop explicitly: having to unregister all
>   handlers or delayed calls is a PITA in non-trivial situations (for
>   example you might have multiple protocol instances, each with a bunch
>   of timers, some perhaps even in third-party libraries; keeping track
>   of all this is the event loop's job)

I've been convinced of that too. I'm just procrastinating on the
implementation at this point.

TBH the details of what you should put in your main program will
probably change a few times before we're done...

> * The optional sock_*() methods: how about having different ABCs, e.g.
>   the EventLoop ABC for basic behaviour, and the NetworkedEventLoop ABC
>   adding the socket helpers?

Hm. That smells of Twisted's tree of interfaces, which I'm honestly
trying to get away from (and Glyph didn't push back on that :-).

I'm actually leaning towards requiring these for all loop
implementations -- surely they can all be emulated using each other.

But I'm not totally wedded to that either. I need more experience
using the stuff first. And Steve Dower says he's not interested in any
of the async I/O stuff (I suppose he means sockets), just in futures
and coroutines. So maybe the socket operations do have to be optional.
In that case, I propose to add inquiry functions that can tell you
whether certain groups of APIs are supported. Though you can probably
get away with hasattr(loop, 'sock_recv') and so on.

> Protocols and transports
> ------------------------
>
> We probably want to provide a Protocol base class and encourage people
> to inherit it.

Glyph suggested that too, and hinted that it does some useful stuff
that users otherwise forget. I'm a bit worried though that the
functionality of the base implementation becomes the de-facto standard
rather than the PEP. (Glyph mentions that the base class has a method
that sets self.transport and without it lots of other stuff breaks.)

> It can provide useful functionality (perhaps write()
> and writelines() shims? it can make mocking easier).

Those are transport methods though.

> My own opinion about Twisted's API is that the Factory class is often
> useless, and adds a cognitive burden. If you need a place to track all
> protocols of a given kind (e.g. all connections), you can do it
> yourself. Also, the Factory implies that you don't control how exactly
> your protocol gets instantiated (unless you override some method on the
> Factory I'm missing the name of: it is cumbersome).

Yeah, Glyph complains that people laugh at Twisted for using factories. :-)

> So, when creating a client, I would pass it a protocol instance.

Heh. That's how I started, and Glyph told me to pass a protocol
factory. It can just be a Protocol subclass though, as long as the
constructor has the right signature. So maybe we can avoid calling it
protocol_factory and name it protocol_class instead.

I struggled with what to do if the socket cannot be connected and
hence the transport not created. If you've already created the
protocol you're in a bit of trouble at that point. I proposed to call
connection_lost() in that case (without ever having called
connection_made()) but Glyph suggested that would be asking for rare
bugs (the connection_lost() code might not expect a half-initialized
protocol instance). Glyph proposed instead that create_transport()
should return a Future and the error should be that Future's
exception, and I like that much better.

> When creating a server, I would pass it a protocol class. Here the base
> Protocol class comes into play, its __init__() could take the transport
> as argument and set the "transport" attribute with it. Further args
> could be optionally passed to the constructor:
>
> class MyProtocol(Protocol):
>     def __init__(self, transport, my_personal_attribute):
>         Protocol.__init__(self, transport)
>         self.my_personal_attribute = my_personal_attribute
>     ...
>
> def listen(ioloop):
>     # Each new connection will instantiate a MyProtocol with "foobar"
>     # for my_personal_attribute.
>     ioloop.listen_tcp(("0.0.0.0", 8080), MyProtocol, "foobar")
>
> (The hypothetical listen_tcp() is just a name: perhaps it's actually
> start_serving(). It should accept any callable, not just a class:
> therefore, you can define complex behaviour if you like)

I agree that it should be a callable, not necessarily a class. I don't
think it should take the transport -- that's what connection_made() is
for. I don't think we should make the API have additional arguments
either; you can use a lambda or functools.partial to pass those in.
(There are too many other arguments to start_serving() to make it
convenient or clear to have a *args, I think, though maybe we could
rearrange the argument order.)

> I think the transport / protocol registration must be done early, not in
> connection_made(). Sometimes you will want to do things on a protocol
> before you know a connection is established, for example queue things
> to write on the transport. An use case is a reconnecting TCP client:
> the protocol will continue existing at times when the connection is
> down.

Hm. That seems a pretty advanced use case. I think it is better
handled by passing a "factory function" that returns a pre-created
protocol:

pr = MyProtocol(...)
ev.create_transport(lambda: pr, host, port)

However you do this, such a protocol object must expect multiple
connection_made - connection_lost cycles, which sounds to me like
asking for trouble. So maybe it's better to have a thin protocol class
that is newly instantiated for each reconnection but given a pointer
to a more permanent data structure that carries state between
reconnections.

> Unconnected protocols need their own base class and API:
> data_received()'s signature should be (data, remote_addr) or
> (remote_addr, data). Same for write().

You mean UDP? Let's put that off until later. But yes, it probably
needs more thought.

> * writelines() sounds ambiguous for datagram protocols: does it send
>   those "lines" as a single datagram, or one separate datagram per
>   "line"? The equivalent code suggests the latter, but which one makes
>   more sense?

It is the transport's choice. Twisted has writeSequence(), which is
just as ambiguous.

> * connection_lost(): you definitely want to know whether it's you or the
>   other end who closed the connection. Typically, if the other end
>   closed the connection, you will have to run some cleanup steps, and
>   perhaps even log an error somewhere (if the connection was closed
>   unexpectedly).

Glyph's idea was to always pass an exception and use special exception
subclasses to distinguish the three cases (clean eof from other end,
self.close(), self.abort(). I resisted this but maybe it's the only
way?

>   Actually, I'm not sure it's useful to call connection_lost() when you
>   closed the connection yourself: are there any use cases?

Well, close() first has to finish writing buffered data, so any
cleanup needs to be done asynchronously after that is taken care off.
AFAIK Twisted always calls it, and I think that's the best approach to
ensure cleanup is always taken care of.

-- 
--Guido van Rossum (python.org/~guido)