[Python-ideas] PEP 3156 - Asynchronous IO Support Rebooted

Tue Jan 8 21:11:25 CET 2013

(Trimming stuff that doesn't need a reply -- this doesn't mean I
agree, just that I don't see a need for more discussion.)

On Sun, Jan 6, 2013 at 7:45 AM, Markus <nepenthesdev at gmail.com> wrote:
> Exactly - signals are a mess, threading and signals make things worse
> - I'm no expert here, but I just have had experienced problems with
> signal handling and threads, basically the same problems you describe.
> Creating the threads after installing signal handlers (in the main
> thread) works, and signals get delivered to the main thread,
> installing the signal handlers (in the main thread) after creating the
> threads - and the signals ended up in *some thread*.
> Additionally it depended on if you'd install your signal handler with
> signal() or sigaction() and flags when creating threads.

So I suppose you're okay with the signal handling API I proposed? I'll
add it to the PEP then, with a note that it may raise an exception if
not supported.

> I'd expect the EventLoop never to create threads on his own behalf,
> it's just wrong.

Here's the way it works. You can call run_in_executor(executor,
function, *args) where executor is an executor (a fancy thread pool)
that you create. You have full control. However you can pass
executor=None and then the event loop will create its own, default
executor -- or it will use a default executor that you have created
and given to it previously.

It needs the default executor so that it can implement getaddrinfo()
by calling the stdlib socket.getaddrinfo() in a thread -- and
getaddrinfo() is essential for creating transports. The user can take
full control over the executor though -- you could set the default to
something that always raises an exception.

> If you can't provide some functionality without threads, don't provide
> the functionality.

I don't see this as an absolute requirement. The threads are an
implementation detail (other event loop implementations could
implement getaddrinfo() differently, taking directly to DNS using
tasklets or callbacks), and you can control its use of threads.

> Besides, getaddrinfo() is a bad choice, as it relies on distribution
> specific flags.
> For example ip6 link local scope exists on every current platform, but
> - when resolving an link local scope address -not domain- with
> getaddrinfo, getaddrinfo will fail if no global routed ipv6 address is
> available on debian/ubuntu.

Nevertheless it is the only thing available in the stdlib. If you want
to improve it, that's fine, but just use the issue tracker.

>>> As Transport are part of the PEP - some more:
>>>
>>> EventLoop
>>>  * create_transport(protocol_factory, host, port, **kwargs)
>>>    kwargs requires "local" - local address as tuple like
>>> ('fe80::14ad:1680:54e1:6a91%eth0',0) - so you can bind when using ipv6
>>> link local scope.
>>>   or ('192.168.2.1',5060) - bind local port for udp
>>
>> Not sure I understand. What socket.connect() (or other API) call
>> parameters does this correspond to? What can't expressed through the
>> host and port parameters?
>
> In case you have multiple interfaces, and multiple gateways, you need
> to assign the connection to an address - so the kernel knows which
> interface to use for the connection - else he'd default to "the first"
> interface.
> In IPv6 link-local scope you can have multiple addresses in the same
> subnet fe80:: - IIRC if you want to connect somewhere, you have to
> either set the scope_id of the remote, or bind the "source" address
> before - I don't know how to set the scope_id in python, it's in
> sockaddr_in6.
>
> In terms of socket. it is a bind before a connect.
>
> s = socket.socket(AF_INET6,SOCK_DGRAM,0)
> s.bind(('fe80::1',0))
> s.connect(('fe80::2',4712))
>
> same for ipv4 in case you are multi homed and rely on source based routing.

Ok, this seems a useful option to add to create_transport(). Your
example shows SOCK_DGRAM -- is it also relevant for SOCK_STREAM?

>>> Handler:
>>> Requiring 2 handlers for every active connection r/w is highly ineffective.
>>
>> How so? What is the concern?
>
> Of course you can fold the fdsets, but in case you need a seperate
> handler for write, you re-create it for every write - see below.

That would seem to depend on the write rate.

>>> Additionally, I can .stop() the handler without having to know the fd,
>>> .stop() the handler, change the events the handler is looking for,
>>> restart the handler with .start().
>>> In your proposal, I'd create a new handler every time I want to sent
>>> something, poll for readability - discard the handler when I'm done,
>>> create a new one for the next sent.
>>
>> The questions are, does it make any difference in efficiency (when
>> using Python -- the performance of the C API is hardly relevant here),
>> and how often does this pattern occur.
>
> Every time you send - you poll for write-ability, you get the
> callback, you write, you got nothing left, you stop polling for
> write-ability.

That's not quite how it's implemented. The code first tries to send
without polling. Since the socket is non-blocking, if this succeeds,
great -- only if it returns a partial send or EAGAIN we register a
callback. If the protocol keeps the buffer filled the callback doesn't
have to be recreated each time. If the protocol doesn't keep the
buffer full, we must unregister the callback to prevent
select/poll/etc. from calling it over and over again, there's nothing
you can do about that.

>>>  * reconnect() - I'd love to be able to reconnect a transport
>>
>> But what does that mean in general? It depends on the protocol (e.g.
>> FTP, HTTP, IRC, SMTP) how much state must be restored/renegotiated
>> upon a reconnect, and how much data may have to be re-sent. This seems
>> a higher-level feature that transports and protocols will have to
>> implement themselves.
>
> I don't need the EventLoop to sync my state upon reconnect - just have
> the Transport providing the ability.
> Protocols are free to use this, but do not have to.

Aha, I get it. You want to be able to call transport.reconnect() from
connection_lost() and it should respond by eventually calling
protocol.connection_made(transport) again. Of course, this only
applies to clients -- for a server to reconnect to a client makes no
sense (it would be up to the client).

That seems simple enough to implement, but Glyph recommended strongly
against this, because reusing the protocol object often means that
some private state of the protocol may not be properly reinitialized.

It would also be difficult to decide where errors from the reconnect
attempt should go -- reconnect() itself must return immediately (since
connection_lost() cannot wait for I/O, it can only schedule async I/O
events).

But at a higher level in your app it would be easy to set this up: you
just call eventloop.create_transport(lambda: protocol, ...) where
protocol is a protocol instance you've created earlier.

>> Twisted suggested something here which I haven't implemented yet but
>> which seems reasonable -- using a series of short timeouts try
>> connecting to the various addresses and keep the first one that
>> connects successfully. If multiple addresses connect after the first
>> timeout, too bad, just close the redundant sockets, little harm is
>> done (though the timeouts should be tuned that this is relatively
>> rare, because a server may waste significant resources on such
>> redundant connects).
>
> Fast, yes - reasonable? - no.
> How would you feel if web browsers behaved like this?

I have no idea -- who says they aren't doing this? Browsers do tons of
stuff that I am not aware of.

> domain name has to be resolved, addresses ordered according to rfc X
> which says prefer ipv6 etc., try connecting linear.

Sure. It was just an idea. I'll see what Twisted actually does.

>>> For almost all the timeouts I mentioned - the protocol needs to take
>>> care - so the protocol has to exist before the connection is
>>> established in case of outbound connections.
>>
>> I'm not sure I follow. Can you sketch out some code to help me here?
>> ISTM that e.g. the DNS, connect and handshake timeouts can be
>> implemented by the machinery that tries to set up the connection
>> behind the scenes, and the user's protocol won't know anything of
>> these shenanigans. The code that calls create_transport() (actually
>> it'll probably be renamed create_client()) will just get a Future that
>> either indicates success (and then the protocol and transport are
>> successfully hooked up) or an error (and then no protocol was created
>> -- whether or not a transport was created is an implementation
>> detail).
>
> From my understanding the Future does not provide any information
> which connection to which host using which protocol and credentials
> failed?

That's not up to the Future -- it just passes an exception object
along. We could make this info available as attributes on the
exception object, if there is a need.

> I'd create the Procotol when trying to create a connection, so the
> Protocol is informed when the Transport fails and can take action -
> retry, whatever.

I had this in an earlier version, but Glyph convinced me that this is
the wrong design -- and it doesn't work for servers anyway, you must
have a protocol factory there.

>>>  * data_received(data) - if it was possible to return the number of
>>> bytes consumed by the protocol, and have the Transport buffer the rest
>>> for the next io in call, one would avoid having to do this in every
>>> Protocol on it's own - learned from experience.
>>
>> Twisted has a whole slew of protocol implementation subclasses that
>> implement various strategies like line-buffering (including a really
>> complex version where you can turn the line buffering on and off) and
>> "netstrings". I am trying to limit the PEP's size by not including
>> these, but I fully expect that in practice a set of useful protocol
>> implementations will be created that handles common cases. I'm not
>> convinced that putting this in the transport/protocol interface will
>> make user code less buggy: it seems easy for the user code to miscount
>> the bytes or not return a count at all in a rarely taken code branch.
>
> Please don't drop this.
>
> You never know how much data you'll receive, you never know how much
> data you need for a message, so the Protocol needs a buffer.

That all depends on what the protocol is trying to do. (The ECHO
protocol certainly doesn't need a buffer. :-)

> Having this io in buffer in the Transports allows every Protocol to
> benefit, they try to read a message from the data passed to
> data_received(), if the data received is not sufficient to create a
> full message, they need to buffer it and wait for more data.

Having it in a Protocol base class also allows every protocol that
wants it to benefit, without complicating the transport. I can also
see problems where the transport needs to keep calling data_received()
until either all data is consumed or it returns 0 (no data consumed).
It just doesn't seem right to make the transport responsible for this
logic, since it doesn't know enough about the needs of the protocol.

> So having the Protocol.data_received return the number of bytes the
> Protocol could process, the Transport can do the job, saving it for
> every Protocol.
> Still - a protocol can have it's own buffering strategy, i.e. in case
> of a incremental XML parser which does it's own buffering, and always
> return len(data), so the Transport does not buffer anything.

Right, data_received() is closely related to the concept of a "feed
parser" which is used in a few places in the stdlib
(http://docs.python.org/3/search.html?q=feed&check_keywords=yes&area=default)
and even has a 3rd party implementation
(http://pypi.python.org/pypi/feedparser/), and there the parser (i.e.
the protocol equivalent) is always responsible for buffering data it
cannot immediately process.

>>> Next, what happens if a dns can not be resolved, ssl handshake (in
>>> case ssl is transport) or connecting fails - in my opinion it's an
>>> error the protocol is supposed to take care of
>>>  + error_dns
>>>  + error_ssl
>>>  + error_connecting
>>
>> The future returned by create_transport() (aka create_client()) will
>> raise the exception.
>
> When do I get this exception - the EventLoop.run() raises?

No, the eventloop doesn't normally raise, just whichever task is
waiting for that future using 'yield from' will get the exception. Or
you can use eventloop.run_until_complete(<future>) and then that call
will raise.

-- 
--Guido van Rossum (python.org/~guido)