Re: [Python-ideas] PEP 3156 - Asynchronous IO Support Rebooted
On Fri, Jan 4, 2013 at 6:53 PM, Markus
On Fri, Jan 4, 2013 at 11:33 PM, Guido van Rossum
wrote: On Wed, Dec 26, 2012 at 2:38 PM, Markus
wrote: First shoot should be getting a well established event loop into python.
Perhaps. What is your definition of an event loop?
I ask the loop to notify me via callback if something I care about happens.
Heh. That's rather too general -- it depends on "something I care about" which could be impossible to guess. :-)
Usually that's fds and read/writeability.
Ok, although on some platforms it can't be a fd (UNIX-style small integer) but some other abstraction, e.g. a socket *object* in Jython or a "handle" on Windows (but I am already starting to repeat myself :-).
I create a data structure which has the fd, the event I care about, the callback and userdata, pass it to the loop, and the loop will take care.
Next, timers, same story, I create a data structure which has the time I care about, the callback and userdata, pass it to the loop, and the loop will take care.
The "create data structure" part is a specific choice of interface style, not necessarily the best for Python. Most event loop implementations I've seen for Python (pyev excluded) just have various methods that express everything through the argument list, not with a separate data structure.
Signals - sometimes having signals in the event loop is handy too. Same story.
Agreed, I've added this to the open issues section in the PEP. Do you have a suggestion for a minimal interface for signal handling? I could imagine the following: - add_signal_handler(sig, callback, *args). Whenever signal 'sig' is received, arrange for callback(*args) to be called. Returns a Handler which can be used to cancel the signal callback. Specifying another callback for the same signal replaces the previous handler (only one handler can be active per signal). - remove_signal_handler(sig). Removes the handler for signal 'sig', if one is set. Is anything else needed? Note that Python only receives signals in the main thread, and the effect may be undefined if the event loop is not running in the main thread, or if more than one event loop sets a handler for the same signal. It also can't work for signals directed to a specific thread (I think POSIX defines a few of these, but I don't know of any support for these in Python.)
But sockets are not native on Windows, and I am making some effort with PEP 3156 to efficiently support higher-level abstractions without tying them to sockets. (The plan is to support IOCP on Windows. The previous version of Tulip already had a branch that did support that, as a demonstration of the power of this abstraction.)
Supporting IOCP on windows is absolutely required, as WSAPoll is broken and won't be fixed. http://social.msdn.microsoft.com/Forums/hu/wsk/thread/18769abd-fca0-4d3c-988...
Wow. Now I'm even more glad that we're planning to support IOCP.
Only if the C code also uses libev, of course. But C programs may use other event mechanisms -- e.g. AFAIK there are alternatives to libev (during the early stages of Tulip development I chatted a bit with one of the original authors of libevent, Niels Provos, and I believe there's also something called libuv), and GUI frameworks (e.g. X, Qt, Gtk, Wx) tend to have their own event loop.
libuv is a wrapper around libev -adding IOCP- which adds some other things besides an event loop and is developed for/used in node.js.
Ah, that's helpful. I did not realize this after briefly skimming the libuv page. (And the github logs suggest that it may no longer be the case: https://github.com/joyent/libuv/commit/1282d64868b9c560c074b9c9630391f3b18ef...
PEP 3156 is designed to let alternative *implementations* of the same *interface* be selected at run time. Hopefully it is possible to provide a conforming implementation using libev -- then your goal (smooth interoperability with C code using libev) is obtained.
Smooth interoperability is not a major goal here - it's great if you get it for free. I'm just looking forward an event loop in the stdlib I want to use.
Heh, so stop objecting. :-)
(It would also be harder to implement initially as a 3rd party framework. At the lowest level, no changes to Python itself are needed -- it already supports non-blocking sockets, for example. But adding optional callbacks to existing low-level APIs would require changes throughout the stdlib.)
As a result - making the stdlib async io aware - the complete stdlib. Would be great.
No matter what API style is chosen, making the entire stdlib async aware will be tough. No matter what you do, the async support will have to be "pulled through" every abstraction layer -- e.g. making sockets async-aware doesn't automatically make socketserver or urllib2 async-aware(*). With the strong requirements for backwards compatibility, in many cases it may be easier to define a new API that is suitable for async use instead of trying to augment existing APIs. (*) Unless you use microthreads, like gevent, but this has its own set of problems -- I don't want to get into that here, since we seem to at least agree on the need for an event loop with callbacks.
I am not so concerned about naming (it seems inevitable that everyone uses somewhat different terminology anyway, and it is probably better not to reuse terms when the meaning is different), but I do like to look at guarantees (or the absence thereof!) and best practices for dealing with the differences between platforms.
Handler - the best example for not re-using terms.
??? (Can't tell if you're sarcastic or agreeing here.)
You haven't convinced me about this.
Fine, if you include transports, I'll pick on the transports as well ;)
??? (Similar.)
However, you can help me by comparing the event loop part of PEP 3156 (ignoring anything that returns or takes a Future) to libev and pointing out things (either specific APIs or certain guarantees or requirements) that would be hard to implement using libev, as well as useful features in libev that you think every event loop should have.
Note: In libev only the "default event loop" can have timers.
Interesting. This seems an odd constraint.
EventLoop
* run() - ev_run(struct ev_loop) * stop() - ev_break(EV_UNLOOP_ALL) * run_forever() - registering an idle watcher will keep the loop alive * run_once(timeout=None) - registering an timer, have the timer stop() the loop * call_later(delay, callback, *args) - ev_timer * call_repeatedly(interval, callback, **args) - ev_timer (periodic) * call_soon(callback, *args) - Equivalent to call_later(0, callback, *args). - call_soon_threadsafe(callback, *args) - it would be better to have
the event loops taking care of signals too, else waking up an ev_async in the loop which checks a async queue which contains the required information to register the call_soon callback would be possible
Not sure I understand. PEP 3156/Tulip uses a self-pipe to prevent race conditions when call_soon_threadsafe() is called from a signal handler or other thread(*) -- but I don't know if that is relevant or not. (*) http://code.google.com/p/tulip/source/browse/tulip/unix_events.py#448 and http://code.google.com/p/tulip/source/browse/tulip/unix_events.py#576
- getaddrinfo(host, port, family=0, type=0, proto=0, flags=0) - libev does not do dns - getnameinfo(sockaddr, flags=0) - libev does not do dns
Note that these exist at least in part so that an event loop implementation may *choose* to implement its own DNS handling (IIUC Twisted has this), whereas the default behavior is just to run socket.getaddrinfo() -- but in a separate thread because it blocks. (This is a useful test case for run_in_executor() too.)
- create_transport(protocol_factory, host, port, **kwargs) - libev does not do transports - start_serving(protocol_factory, host, port, **kwds) - libev does not do transports * add_reader(fd, callback, *args) - create a ev_io watcher with EV_READ * add_writer(fd, callback, *args) - create ev_io watcher with EV_WRITE * remove_reader(fd) - in libev you have to name the watcher you want to stop, you can not remove watchers/handlers by fd, workaround is maintaining a dict with fd:Handler in the EventLoop
Ok, this does not sound like a show-stopper for a conforming PEP 3156 implementation on top of libev then, right? Just a minor inconvenience. I'm sure everyone has *some* impedance mismatches to deal with.
* remove_writer(fd) - same * add_connector(fd, callback, *args) - poll for writeability, getsockopt, done
TBH, I'm not 100% convinced of the need for add_connector(), but Richard Oudkerk claims that it is needed for Windows. (OTOH if WSAPoll() is too broken to bother, maybe we don't need it. It's a bit of a nuisance because code that uses add_writer() instead works just fine on UNIX but would be subtly broken on Windows, leading to disappointments when porting apps to Windows. I'd rather have things break on all platforms, or on none...)
* remove_connector(fd) - same as with all other remove-by-fd methods
As Transport are part of the PEP - some more:
EventLoop * create_transport(protocol_factory, host, port, **kwargs) kwargs requires "local" - local address as tuple like ('fe80::14ad:1680:54e1:6a91%eth0',0) - so you can bind when using ipv6 link local scope. or ('192.168.2.1',5060) - bind local port for udp
Not sure I understand. What socket.connect() (or other API) call parameters does this correspond to? What can't expressed through the host and port parameters?
* start_serving(protocol_factory, host, port, **kwds) what is the behaviour for SOCK_DGRAM - does this multiplex sessions based on src host/port / dst host/port - I'd love it.
TBH I haven't thought much about datagram transports. It's been years since I used UDP. I guess the API may have to distinguish between connected and unconnected UDP. I think the transport/protocol API will be different than for SOCK_STREAM: for every received datagram, the transport will call protocol.datagram_received(data, address), (the address will be a dummy for connected use) and to send a datagram, the protocol must call tranport.write_datagram(data, [address]), which returns immediately. Flow control (if supported) should work the same as for streams: if the transport finds its buffers exceed a certain limit it will tell the protocol to back off by calling protocol.pause().
Handler: Requiring 2 handlers for every active connection r/w is highly ineffective.
How so? What is the concern? The actions of the read and write handler are typically completely different, so the first thing the handler would have to do is to decide whether to call the read or the write code. Also, depending on flow control, only one of the two may be active. If you are after minimizing the number of records passed to [e]poll or kqueue, you can always collapse the handlers at that level and distinguish between read/write based on the mask and recover the appropriate user-level handler from the readers/writers array (and this is what Tulip's epoll pollster class does). PS. Also check out this issue, where an implementation of *just* Tulip's pollster class for the stdlib is being designed: http://bugs.python.org/issue16853; also check out the code reviews here; http://bugs.python.org/review/16853/
I'd prefer to be able to create a Handler from a loop. Handler = EventLoop.create_handler(socket, callback, events) and have the callback called with the returned events, so I can multiplex read/write op in the callback.
Hm. See above.
Additionally, I can .stop() the handler without having to know the fd, .stop() the handler, change the events the handler is looking for, restart the handler with .start(). In your proposal, I'd create a new handler every time I want to sent something, poll for readability - discard the handler when I'm done, create a new one for the next sent.
The questions are, does it make any difference in efficiency (when using Python -- the performance of the C API is hardly relevant here), and how often does this pattern occur.
Timers: Not in the PEP - re-arming a timer lets say I want to do something if nothing happens for 5 seconds. I create a timer call_later(5.,cb), if something happens, I need to cancel the timer and create a new one. If there was a Timer: Timer.stop() Timer.set(5) Timer.start()
Actually it's one less call using the PEP's proposed API: timer.cancel() timer = loop.call_later(5, callback) Which of the two idioms is faster? Who knows? libev's pattern is probably faster in C, but that has little to bear on the cost in Python. My guess is that the amount of work is about the same -- the real cost is that you have to make some changes the heap used to keep track of all timers in the order in which they will trigger, and those changes are the same regardless of how you style the API.
Transports: I think SSL should be a Protocol not a transport - implemented using BIO pairs. If you can chain protocols, like Transport / ProtocolA / ProtocolB you can have TCP / SSL / HTTP as https or TCP / SSL / SOCKS / HTTP as https via ssl enabled socks proxy without having to much problems. Another example, shaping a connection TCP / RATELIMIT / HTTP.
Interesting idea. This may be up to the implementation -- not every implementation may have BIO wrappers available (AFAIK the stdlib doesn't), so the stackability may not be easy to implement everywhere. In any case, when you stack things like this, the stack doesn't look like transport<-->protocol<-->protocol<-->protocol; rather, it's A<-->B<-->C<-->D where each object has a "left" and a "right" API. Each arrow connects the "transport (right) half" of the object on its left (e.g. A) to the "protocol (left) half" of the object on the arrow's right (e.g. B). So maybe we can visualise this as T1 <--> P2:T2 <--> P3:T3 <--> P4.
Having SSL as a Protocol allows closing the SSL connection without closing the TCP connection, re-using the TCP connection, re-using a SSL session cookie during reconnect of the SSL Protocol.
That seems a pretty esoteric use case (though given your background in honeypots maybe common for you :-). It also seems hard to get both sides acting correctly when you do this (but I'm certainly no SSL expert -- I just want it supported because half the web is inaccessible these days if you don't speak SSL, regardless of whether you do any actual verification). All in all I think that stackable transports/protocols are mostly something that is enabled by the interfaces defined here (the PEP takes care not to specify any base classes from which you must inherit -- you must just implement certain methods, and the rest is duck typing) but otherwise does not concern the PEP much. The only concern I have, really, is that the PEP currently hints that both protocols and transports might have pause() and resume() methods for flow control, where the protocol calls transport.pause() if protocol.data_received() is called too frequently, and the transport calls protocol.pause() if transport.write() has buffered more data than sensible. But for an object that is both a protocol and a transport, this would make it impossible to distinguish between pause() calls by its left and right neighbors. So maybe the names must differ. Given the tendency of transport method names to be shorter (e.g. write()) vs. the longer protocol method names (data_received(), connection_lost() etc.), perhaps it should be transport.pause() and protocol.pause_writing() (and similar for resume()).
* reconnect() - I'd love to be able to reconnect a transport
But what does that mean in general? It depends on the protocol (e.g. FTP, HTTP, IRC, SMTP) how much state must be restored/renegotiated upon a reconnect, and how much data may have to be re-sent. This seems a higher-level feature that transports and protocols will have to implement themselves.
* timers - Transports need timers
I think you mean timeouts?
* dns-resolve-timeout - dns can be slow * connecting-timeout - connecting can take too much time, more than we want to wait * idle-timeout ( no action on the connection for a while ) - call protocol.timeout_idle() * sustain-timeout ( max session time ) - close() transport * ssl-handshake-timeout ( in case ssl is a Transport ) - close transport * close-timeout (shutdown is async) - close transport hard * reconnect-timeout - (wait some seconds before reconnecting) - reconnect connection
This is an interesting point. I think some of these really do need APIs in the PEP, others may be implemented using existing machinery (e.g. call_later() to schedule a callback that calls cancel() on a task). I've added a bullet on this to Open Issue.
Now, in case we connect to a host by name, and have multiple addresses resolved, and the first connection can not be established, there is no way to 'reconnect()' - as the protocol does not yet exist.
Twisted suggested something here which I haven't implemented yet but which seems reasonable -- using a series of short timeouts try connecting to the various addresses and keep the first one that connects successfully. If multiple addresses connect after the first timeout, too bad, just close the redundant sockets, little harm is done (though the timeouts should be tuned that this is relatively rare, because a server may waste significant resources on such redundant connects).
For almost all the timeouts I mentioned - the protocol needs to take care - so the protocol has to exist before the connection is established in case of outbound connections.
I'm not sure I follow. Can you sketch out some code to help me here? ISTM that e.g. the DNS, connect and handshake timeouts can be implemented by the machinery that tries to set up the connection behind the scenes, and the user's protocol won't know anything of these shenanigans. The code that calls create_transport() (actually it'll probably be renamed create_client()) will just get a Future that either indicates success (and then the protocol and transport are successfully hooked up) or an error (and then no protocol was created -- whether or not a transport was created is an implementation detail).
In case aconnection is lost and reconnecting is required - .reconnect() is handy, so the protocol can request reconnecting.
I'd need more details of how you would like to specify this.
As this does not work with the current Protocols callbacks I propose Protocols.connection_established() therefore.
How does this differ from connection_made()? (I'm trying to follow Twisted's guidance here, they seem to have the longest experience doing these kinds of things. When I talked to Glyph IIRC he was skeptical about reconnecting in general.)
Protocols I'd outline protocol_factory can be a instance of a class, which can set specific parameters for 'things' class p: def __init__(self, a=1,b=2,c=3): self.a = a self.b = b self.c = c def __call__(self): return p(a=self.a, b=self.b, c=self.c) def ... all protocol methods ...: pass
EventLoop.start_serving(p(a=5,b=7), ...) EventLoop.start_serving(p(a=9,b=4), ...)
Same Protocol, different parameters for it.
No such helper method (or class) is needed. You can use a lambda or functools.partial for the same effect. I'll add a note to the PEP to remind people of this.
+ connection_established() + timeout_dns() + timeout_idle() + timeout_connecting()
Signatures please?
* data_received(data) - if it was possible to return the number of bytes consumed by the protocol, and have the Transport buffer the rest for the next io in call, one would avoid having to do this in every Protocol on it's own - learned from experience.
Twisted has a whole slew of protocol implementation subclasses that implement various strategies like line-buffering (including a really complex version where you can turn the line buffering on and off) and "netstrings". I am trying to limit the PEP's size by not including these, but I fully expect that in practice a set of useful protocol implementations will be created that handles common cases. I'm not convinced that putting this in the transport/protocol interface will make user code less buggy: it seems easy for the user code to miscount the bytes or not return a count at all in a rarely taken code branch.
* eof_received()/connection_lost(exc) - a connection can be closed clean recv()=0, unclean recv()=-1, errno, SIGPIPE when writing and in case of SSL even more, it is required to distinguish.
Well, this is why eof_received() exists -- to indicate a clean close. We should never receive SIGPIPE (Python disables this signal, so you always get the errno instead). According to Glyph, SSL doesn't support sending eof, so you have to use Content-length or a chunked encoding. What other conditions do you expect from SSL that wouldn't be distinguished by the exception instance passed to connection_lost()?
+ nextlayer_is_empty() - called if the Transport (or underlying Protocol in case of chaining) write buffer is empty - Imagine an http server sending a 1GB file, you do not want to sent 1GB at once - as you do not have that much memory, but get a callback if the transport done sending the chunk you've queued, so you can send the next chunk of data.
That's what the pause()/resume() flow control protocol is for. You read the file (presumably it's a file) in e.g. 16K blocks and call write() for each block; if the transport can't keep up and exceeds its buffer space, it calls protocol.pause() (or perhaps protocol.pause_writing(), see discussion above).
Next, what happens if a dns can not be resolved, ssl handshake (in case ssl is transport) or connecting fails - in my opinion it's an error the protocol is supposed to take care of + error_dns + error_ssl + error_connecting
The future returned by create_transport() (aka create_client()) will raise the exception.
I'm not that much into futures - so I may have got some things wrong.
No problem. You may want to read PEP 3148, it explains Futures and much of that explanation remains valid; just in PEP 3156 to wait for a future you must use "yield from <future>". -- --Guido van Rossum (python.org/~guido)
On 05/01/2013 11:30pm, Guido van Rossum wrote:
Supporting IOCP on windows is absolutely required, as WSAPoll is broken and won't be fixed. http://social.msdn.microsoft.com/Forums/hu/wsk/thread/18769abd-fca0-4d3c-988... Wow. Now I'm even more glad that we're planning to support IOCP.
I took care to work around that bug when adding support for WSAPoll() in tulip. -- Richard
On 05/01/2013 11:30pm, Guido van Rossum wrote:
TBH, I'm not 100% convinced of the need for add_connector(), but Richard Oudkerk claims that it is needed for Windows. (OTOH if WSAPoll() is too broken to bother, maybe we don't need it. It's a bit of a nuisance because code that uses add_writer() instead works just fine on UNIX but would be subtly broken on Windows, leading to disappointments when porting apps to Windows. I'd rather have things break on all platforms, or on none...)
add_connector() is needed to work around the brokenness of WSAPoll(). -- Richard
Hi,
Do you have a suggestion for a minimal interface for signal handling? I could imagine the following:
Note that Python only receives signals in the main thread, and the effect may be undefined if the event loop is not running in the main thread, or if more than one event loop sets a handler for the same signal. It also can't work for signals directed to a specific thread (I think POSIX defines a few of these, but I don't know of any support for these in Python.)
Exactly - signals are a mess, threading and signals make things worse - I'm no expert here, but I just have had experienced problems with signal handling and threads, basically the same problems you describe. Creating the threads after installing signal handlers (in the main thread) works, and signals get delivered to the main thread, installing the signal handlers (in the main thread) after creating the threads - and the signals ended up in *some thread*. Additionally it depended on if you'd install your signal handler with signal() or sigaction() and flags when creating threads.
Supporting IOCP on windows is absolutely required, as WSAPoll is broken and won't be fixed. http://social.msdn.microsoft.com/Forums/hu/wsk/thread/18769abd-fca0-4d3c-988...
Wow. Now I'm even more glad that we're planning to support IOCP.
tulip already has a workaround: http://code.google.com/p/tulip/source/browse/tulip/unix_events.py#244
libuv is a wrapper around libev -adding IOCP- which adds some other things besides an event loop and is developed for/used in node.js.
Ah, that's helpful. I did not realize this after briefly skimming the libuv page. (And the github logs suggest that it may no longer be the case: https://github.com/joyent/libuv/commit/1282d64868b9c560c074b9c9630391f3b18ef...
Okay, they moved to libngx - nginx core library, obviously I missed this.
Handler - the best example for not re-using terms.
??? (Can't tell if you're sarcastic or agreeing here.)
sarcastic.
Fine, if you include transports, I'll pick on the transports as well ;)
??? (Similar.)
Not sarcastic.
Note: In libev only the "default event loop" can have timers.
Interesting. This seems an odd constraint.
I'm wrong - discard. This limitation refered to watchers for child processes.
EventLoop - call_soon_threadsafe(callback, *args) - it would be better to have Not sure I understand. PEP 3156/Tulip uses a self-pipe to prevent race conditions when call_soon_threadsafe() is called from a signal handler or other thread(*) -- but I don't know if that is relevant or not.
ev_async is a self-pipe too.
(*) http://code.google.com/p/tulip/source/browse/tulip/unix_events.py#448 and http://code.google.com/p/tulip/source/browse/tulip/unix_events.py#576
- getaddrinfo(host, port, family=0, type=0, proto=0, flags=0) - libev does not do dns - getnameinfo(sockaddr, flags=0) - libev does not do dns
Note that these exist at least in part so that an event loop implementation may *choose* to implement its own DNS handling (IIUC Twisted has this), whereas the default behavior is just to run socket.getaddrinfo() -- but in a separate thread because it blocks. (This is a useful test case for run_in_executor() too.)
I'd expect the EventLoop never to create threads on his own behalf, it's just wrong. If you can't provide some functionality without threads, don't provide the functionality. Besides, getaddrinfo() is a bad choice, as it relies on distribution specific flags. For example ip6 link local scope exists on every current platform, but - when resolving an link local scope address -not domain- with getaddrinfo, getaddrinfo will fail if no global routed ipv6 address is available on debian/ubuntu.
As Transport are part of the PEP - some more:
EventLoop * create_transport(protocol_factory, host, port, **kwargs) kwargs requires "local" - local address as tuple like ('fe80::14ad:1680:54e1:6a91%eth0',0) - so you can bind when using ipv6 link local scope. or ('192.168.2.1',5060) - bind local port for udp
Not sure I understand. What socket.connect() (or other API) call parameters does this correspond to? What can't expressed through the host and port parameters?
In case you have multiple interfaces, and multiple gateways, you need to assign the connection to an address - so the kernel knows which interface to use for the connection - else he'd default to "the first" interface. In IPv6 link-local scope you can have multiple addresses in the same subnet fe80:: - IIRC if you want to connect somewhere, you have to either set the scope_id of the remote, or bind the "source" address before - I don't know how to set the scope_id in python, it's in sockaddr_in6. In terms of socket. it is a bind before a connect. s = socket.socket(AF_INET6,SOCK_DGRAM,0) s.bind(('fe80::1',0)) s.connect(('fe80::2',4712)) same for ipv4 in case you are multi homed and rely on source based routing.
Handler: Requiring 2 handlers for every active connection r/w is highly ineffective.
How so? What is the concern?
Of course you can fold the fdsets, but in case you need a seperate handler for write, you re-create it for every write - see below.
Additionally, I can .stop() the handler without having to know the fd, .stop() the handler, change the events the handler is looking for, restart the handler with .start(). In your proposal, I'd create a new handler every time I want to sent something, poll for readability - discard the handler when I'm done, create a new one for the next sent.
The questions are, does it make any difference in efficiency (when using Python -- the performance of the C API is hardly relevant here), and how often does this pattern occur.
Every time you send - you poll for write-ability, you get the callback, you write, you got nothing left, you stop polling for write-ability.
Timers: ... Timer.stop() Timer.set(5) Timer.start()
Actually it's one less call using the PEP's proposed API:
timer.cancel() timer = loop.call_later(5, callback)
My example was ill-chosen, problem for both of us - how to we know it's 5 seconds? timer.restart() or timer.again() the timer could remember it's interval, else you have to store the interval somewhere, next to the timer.
Which of the two idioms is faster? Who knows? libev's pattern is probably faster in C, but that has little to bear on the cost in Python. My guess is that the amount of work is about the same -- the real cost is that you have to make some changes the heap used to keep track of all timers in the order in which they will trigger, and those changes are the same regardless of how you style the API.
Speed, nothing is fast in every circumstances, for example select is faster than epoll for small numbers of sockets. Let's look on usability.
Transports: I think SSL should be a Protocol not a transport - implemented using BIO pairs. If you can chain protocols, like Transport / ProtocolA / ProtocolB you can have TCP / SSL / HTTP as https or TCP / SSL / SOCKS / HTTP as https via ssl enabled socks proxy without having to much problems. Another example, shaping a connection TCP / RATELIMIT / HTTP.
Interesting idea. This may be up to the implementation -- not every implementation may have BIO wrappers available (AFAIK the stdlib doesn't),
Right, for ssl bios pyopenssl is required - or ctypes.
So maybe we can visualise this as T1 <--> P2:T2 <--> P3:T3 <--> P4.
Yes, exactly.
Having SSL as a Protocol allows closing the SSL connection without closing the TCP connection, re-using the TCP connection, re-using a SSL session cookie during reconnect of the SSL Protocol.
That seems a pretty esoteric use case (though given your background in honeypots maybe common for you :-). It also seems hard to get both sides acting correctly when you do this (but I'm certainly no SSL expert -- I just want it supported because half the web is inaccessible these days if you don't speak SSL, regardless of whether you do any actual verification).
Well, proper shutdown is not a SSL protocol requirement, closing the connection hard saves some cycles, so it pays of not do it right in large scaled deployments - such as google. Nevertheless, doing SSL properly can help, as it allows to distinguish from connection reset errors and proper shutdown.
The only concern I have, really, is that the PEP currently hints that both protocols and transports might have pause() and resume() methods for flow control, where the protocol calls transport.pause() if protocol.data_received() is called too frequently, and the transport calls protocol.pause() if transport.write() has buffered more data than sensible. But for an object that is both a protocol and a transport, this would make it impossible to distinguish between pause() calls by its left and right neighbors. So maybe the names must differ. Given the tendency of transport method names to be shorter (e.g. write()) vs. the longer protocol method names (data_received(), connection_lost() etc.), perhaps it should be transport.pause() and protocol.pause_writing() (and similar for resume()).
Protocol.data_received rename to Protocol.io_in Protocol.io_out - in case the transports out buffer is empty - (instead of Protocol.next_layer_is_empty()) Protocol.pause_io_out - in case the transport wants to stop the protocol sending more as the out buffer is crowded already Protocol.resume_io_out - in case the transport wants to inform the protocol the out buffer can take some more bytes again For the Protocol limiting the amount of data received: Transport.pause -> Transport.pause_io_in Transport.resume -> Transport.resume_io_in or drop the "_io" from the names, "(pause|resume_(in|out)"
* reconnect() - I'd love to be able to reconnect a transport
But what does that mean in general? It depends on the protocol (e.g. FTP, HTTP, IRC, SMTP) how much state must be restored/renegotiated upon a reconnect, and how much data may have to be re-sent. This seems a higher-level feature that transports and protocols will have to implement themselves.
I don't need the EventLoop to sync my state upon reconnect - just have the Transport providing the ability. Protocols are free to use this, but do not have to.
Now, in case we connect to a host by name, and have multiple addresses resolved, and the first connection can not be established, there is no way to 'reconnect()' - as the protocol does not yet exist.
Twisted suggested something here which I haven't implemented yet but which seems reasonable -- using a series of short timeouts try connecting to the various addresses and keep the first one that connects successfully. If multiple addresses connect after the first timeout, too bad, just close the redundant sockets, little harm is done (though the timeouts should be tuned that this is relatively rare, because a server may waste significant resources on such redundant connects).
Fast, yes - reasonable? - no. How would you feel if web browsers behaved like this? domain name has to be resolved, addresses ordered according to rfc X which says prefer ipv6 etc., try connecting linear.
For almost all the timeouts I mentioned - the protocol needs to take care - so the protocol has to exist before the connection is established in case of outbound connections.
I'm not sure I follow. Can you sketch out some code to help me here? ISTM that e.g. the DNS, connect and handshake timeouts can be implemented by the machinery that tries to set up the connection behind the scenes, and the user's protocol won't know anything of these shenanigans. The code that calls create_transport() (actually it'll probably be renamed create_client()) will just get a Future that either indicates success (and then the protocol and transport are successfully hooked up) or an error (and then no protocol was created -- whether or not a transport was created is an implementation detail).
From my understanding the Future does not provide any information which connection to which host using which protocol and credentials failed? I'd create the Procotol when trying to create a connection, so the Protocol is informed when the Transport fails and can take action - retry, whatever.
In case aconnection is lost and reconnecting is required - .reconnect() is handy, so the protocol can request reconnecting.
I'd need more details of how you would like to specify this.
Transport * is closed by remote * connecting the remote failed * resolving the domain name failed have inform the protocol about the failure - and if the Protocol changes the Transports state to "reconnect", the Transport creates a "reconnect timer of N seconds", and retries connecting then. It is up to the protocol to login, clean state and start fresh or login and regain old state by issuing required commands to get there. For ftp - this would be changing the cwd.
As this does not work with the current Protocols callbacks I propose Protocols.connection_established() therefore.
How does this differ from connection_made()?
If you create the Protocol before the connection is established - you may want to distinguish from _made() and _established(). You can not distinguish by using __init__, as it may miss the Transport arg.
(I'm trying to follow Twisted's guidance here, they seem to have the longest experience doing these kinds of things. When I talked to Glyph IIRC he was skeptical about reconnecting in general.)
Point is - connections don't last forever, even if we want them to. If the transport supports "reconnect" - it is still upto the protocol to either support it or not. If a Protocol gets disconnected and wants to reconnect -without the Transport supporting .reconnect()- the protocol has to know it's factory.
+ connection_established() + timeout_dns() + timeout_idle() + timeout_connecting()
Signatures please?
+ connection_established(self, transport) the connection is established - in your proposal it is connection_made which I disagree with due to the lack of context in the Futures, returns None + timeout_dns(self) Resolving the domain name failed - Protocol can .reconnect() for another try. returns None + timeout_idle(self) connection was idle for some time - send a high layer keep alive or close the connection - returns None + timeout_connecting(self) connection timed out connection - Protocol can .reconnect() for another try, returns None
* data_received(data) - if it was possible to return the number of bytes consumed by the protocol, and have the Transport buffer the rest for the next io in call, one would avoid having to do this in every Protocol on it's own - learned from experience.
Twisted has a whole slew of protocol implementation subclasses that implement various strategies like line-buffering (including a really complex version where you can turn the line buffering on and off) and "netstrings". I am trying to limit the PEP's size by not including these, but I fully expect that in practice a set of useful protocol implementations will be created that handles common cases. I'm not convinced that putting this in the transport/protocol interface will make user code less buggy: it seems easy for the user code to miscount the bytes or not return a count at all in a rarely taken code branch.
Please don't drop this. You never know how much data you'll receive, you never know how much data you need for a message, so the Protocol needs a buffer. Having this io in buffer in the Transports allows every Protocol to benefit, they try to read a message from the data passed to data_received(), if the data received is not sufficient to create a full message, they need to buffer it and wait for more data. So having the Protocol.data_received return the number of bytes the Protocol could process, the Transport can do the job, saving it for every Protocol. Still - a protocol can have it's own buffering strategy, i.e. in case of a incremental XML parser which does it's own buffering, and always return len(data), so the Transport does not buffer anything. In case the size returned by the Protocol is less than the size of the buffer given to the protocol, the Transport erases only the consumed bytes from the buffer, in case the len matches the size of the buffer passed, erases the buffer. In nonblocking IO - this buffering has to be done for every protocol, if Transports could take care, the data_received method of the Protocol does not need to bother. A benefit for every protocol. Else, every Protocol.data_received method starts with self.buffer += data and ends with self.buffer = self.buffer[len(consumed):] You can even default to use a return value of None like len(data). If you want to be fancy. you could even pass the data to the Protocol as long as the protocol could consume data and there is data left. This way a protocol data_received can focus on processing a single message, if more than a single message is contained in the data - it will get the data again - as it returned > 0, in case there is no message in the data left, it will return 0. This really assists when writing protocols, and as every protocol needs it, have it in Transport.
* eof_received()/connection_lost(exc) - a connection can be closed clean recv()=0, unclean recv()=-1, errno, SIGPIPE when writing and in case of SSL even more, it is required to distinguish.
Well, this is why eof_received() exists -- to indicate a clean close. We should never receive SIGPIPE (Python disables this signal, so you always get the errno instead). According to Glyph, SSL doesn't support sending eof, so you have to use Content-length or a chunked encoding. What other conditions do you expect from SSL that wouldn't be distinguished by the exception instance passed to connection_lost()?
Depends on the implementation of SSL, bio/fd Transport/Protocol SSL_ERROR_SYSCALL and unlikely SSL_ERROR_SSL. In case of stacking TCP / SSL / http a SSL service rejecting a client certificate for login is - to me - a connection_lost too.
+ nextlayer_is_empty() - called if the Transport (or underlying Protocol in case of chaining) write buffer is empty
That's what the pause()/resume() flow control protocol is for. You read the file (presumably it's a file) in e.g. 16K blocks and call write() for each block; if the transport can't keep up and exceeds its buffer space, it calls protocol.pause() (or perhaps protocol.pause_writing(), see discussion above).
I'd still love a callback for "we are empty". Protocol.io_out - maybe the name changes your mind?
Next, what happens if a dns can not be resolved, ssl handshake (in case ssl is transport) or connecting fails - in my opinion it's an error the protocol is supposed to take care of + error_dns + error_ssl + error_connecting
The future returned by create_transport() (aka create_client()) will raise the exception.
When do I get this exception - the EventLoop.run() raises? And this exception has all information required to retry connecting? Let's say I want to reconnect in case of dns error after 20s, the Future raised - depending on the Exception I call_later a callback which create_transport again?- instead of Transport.reconnect() from the Protocol, not really easier. MfG Markus
On Sun, 6 Jan 2013 16:45:52 +0100
Markus
Transports: I think SSL should be a Protocol not a transport - implemented using BIO pairs. If you can chain protocols, like Transport / ProtocolA / ProtocolB you can have TCP / SSL / HTTP as https or TCP / SSL / SOCKS / HTTP as https via ssl enabled socks proxy without having to much problems. Another example, shaping a connection TCP / RATELIMIT / HTTP.
Interesting idea. This may be up to the implementation -- not every implementation may have BIO wrappers available (AFAIK the stdlib doesn't),
Right, for ssl bios pyopenssl is required - or ctypes.
Or a patch to Python 3.4. See http://docs.python.org/devguide/ By the way, how does "SSL as a protocol" deal with SNI? How does the HTTP layer tell the SSL layer which servername to indicate? Or, on the server-side, how would the SSL layer invoke the HTTP layer's servername callback?
(I'm trying to follow Twisted's guidance here, they seem to have the longest experience doing these kinds of things. When I talked to Glyph IIRC he was skeptical about reconnecting in general.)
Point is - connections don't last forever, even if we want them to. If the transport supports "reconnect" - it is still upto the protocol to either support it or not. If a Protocol gets disconnected and wants to reconnect -without the Transport supporting .reconnect()- the protocol has to know it's factory.
+1 to this.
+ connection_established(self, transport) the connection is established - in your proposal it is connection_made which I disagree with due to the lack of context in the Futures, returns None
+ timeout_dns(self) Resolving the domain name failed - Protocol can .reconnect() for another try. returns None
+ timeout_idle(self) connection was idle for some time - send a high layer keep alive or close the connection - returns None
+ timeout_connecting(self) connection timed out connection - Protocol can .reconnect() for another try, returns None
I would rather have connection_failed(self, exc). (where exc can be a OSError or a socket.timeout)
* data_received(data) - if it was possible to return the number of bytes consumed by the protocol, and have the Transport buffer the rest for the next io in call, one would avoid having to do this in every Protocol on it's own - learned from experience.
Twisted has a whole slew of protocol implementation subclasses that implement various strategies like line-buffering (including a really complex version where you can turn the line buffering on and off) and "netstrings". I am trying to limit the PEP's size by not including these, but I fully expect that in practice a set of useful protocol implementations will be created that handles common cases. I'm not convinced that putting this in the transport/protocol interface will make user code less buggy: it seems easy for the user code to miscount the bytes or not return a count at all in a rarely taken code branch.
Please don't drop this.
You never know how much data you'll receive, you never know how much data you need for a message, so the Protocol needs a buffer. Having this io in buffer in the Transports allows every Protocol to benefit, they try to read a message from the data passed to data_received(), if the data received is not sufficient to create a full message, they need to buffer it and wait for more data.
Another solution for every Protocol to benefit is to provide a bunch of base Protocol implementations, as Twisted does: LineReceiver, etc. Your proposed solution (returning the number of consumed bytes) implies a lot of slicing and concatenation of immutable bytes objects inside the Transport, which may be quite inefficient. Regards Antoine.
Hi,
On Sun, Jan 6, 2013 at 5:25 PM, Antoine Pitrou
On Sun, 6 Jan 2013 16:45:52 +0100 Markus
wrote: Right, for ssl bios pyopenssl is required - or ctypes.
Or a patch to Python 3.4. See http://docs.python.org/devguide/
Or discuss merging pyopenssl.
By the way, how does "SSL as a protocol" deal with SNI? How does the HTTP layer tell the SSL layer which servername to indicate? SSL_set_tlsext_host_name
Or, on the server-side, how would the SSL layer invoke the HTTP layer's servername callback?
callback - set via SSL_CTX_set_tlsext_servername_callback SSL_CTX_set_tlsext_servername_arg
I would rather have connection_failed(self, exc). (where exc can be a OSError or a socket.timeout)
I'd prefer a single callback per error, allows to preserve defaults for certain cases when inheriting from Protocol.
You never know how much data you'll receive, you never know how much data you need for a message, so the Protocol needs a buffer. Having this io in buffer in the Transports allows every Protocol to benefit, they try to read a message from the data passed to data_received(), if the data received is not sufficient to create a full message, they need to buffer it and wait for more data.
Another solution for every Protocol to benefit is to provide a bunch of base Protocol implementations, as Twisted does: LineReceiver, etc.
In case your Protocol.data_received gets called until there is nothing left or 0 is returned, the LineReceiver is simply looking for an \0 or \n in the data, process this line and return the length of the line or 0 in case there is no line terminatior.
Your proposed solution (returning the number of consumed bytes) implies a lot of slicing and concatenation of immutable bytes objects inside the Transport, which may be quite inefficient.
Yes - but is has to be done anyway, so it's just a matter of having this problem in stdlib, where it is easy to improve for everybody, or everybody else has to come up with his own implementation as part of Protocol. I'd prefer to have this in Transport therefore - having everybody benefit from any improvement for free. Markus
On Sun, 6 Jan 2013 21:46:04 +0100
Markus
By the way, how does "SSL as a protocol" deal with SNI? How does the HTTP layer tell the SSL layer which servername to indicate? SSL_set_tlsext_host_name
Or, on the server-side, how would the SSL layer invoke the HTTP layer's servername callback?
callback - set via SSL_CTX_set_tlsext_servername_callback SSL_CTX_set_tlsext_servername_arg
Right, these are the C OpenSSL APIs. My question was about the Python protocol / transport level. How can they be exposed?
Your proposed solution (returning the number of consumed bytes) implies a lot of slicing and concatenation of immutable bytes objects inside the Transport, which may be quite inefficient.
Yes - but is has to be done anyway, so it's just a matter of having this problem in stdlib, where it is easy to improve for everybody, or everybody else has to come up with his own implementation as part of Protocol.
Actually, the point is that it doesn't have to be done. An internal buffering mechanism in a protocol can avoid making many copies and concatenations (e.g. by using a list or a deque to buffer the incoming chunks). The transport cannot, since the Protocol API mandates that data_received() be called with a bytes object representing the available data. Regards Antoine.
Hi,
On Sun, Jan 6, 2013 at 10:05 PM, Antoine Pitrou
On Sun, 6 Jan 2013 21:46:04 +0100 Markus
wrote: By the way, how does "SSL as a protocol" deal with SNI? How does the HTTP layer tell the SSL layer which servername to indicate?
Transport.ctrl(name, **kwargs) - if the Transport lacks the queried control, it has to ask his upper. In case of chains like TCP / SSL / HTTP, SSL can query the hostname from it's Transport - or HTTP can query
Or, on the server-side, how would the SSL layer invoke the HTTP layer's servername callback?
Transport.ctrl(name, **kwargs) HTTP can query for the name, in case of TCP / SSL / HTTP, SSL may provide an answer.
Right, these are the C OpenSSL APIs. My question was about the Python protocol / transport level. How can they be exposed?
Attributes of the Transport(-side of a Protocol in case of stacking), which can be queried. For TCP e.g. it would be handy to store connection-related things in a defined data structure which keeps domain, resolved addresses, and used-address for current connection together. like TCP.{local,remote}.{address,addresses,domain,port} For a client, SSL can query for "TCP.remote.domain" and in case it is not an ip address - use for SNI. For a server, HTTP can query SSL.server_name_indication.
An internal buffering mechanism in a protocol can avoid making many copies and concatenations (e.g. by using a list or a deque to buffer the incoming chunks). The transport cannot, since the Protocol API mandates that data_received() be called with a bytes object representing the available data.
bytes-like would be much better then for the definition of data_received. same semantics, but a list of memoryviews with offset, whatever is required internally. MfG Markus
(Trimming stuff that doesn't need a reply -- this doesn't mean I
agree, just that I don't see a need for more discussion.)
On Sun, Jan 6, 2013 at 7:45 AM, Markus
Exactly - signals are a mess, threading and signals make things worse - I'm no expert here, but I just have had experienced problems with signal handling and threads, basically the same problems you describe. Creating the threads after installing signal handlers (in the main thread) works, and signals get delivered to the main thread, installing the signal handlers (in the main thread) after creating the threads - and the signals ended up in *some thread*. Additionally it depended on if you'd install your signal handler with signal() or sigaction() and flags when creating threads.
So I suppose you're okay with the signal handling API I proposed? I'll add it to the PEP then, with a note that it may raise an exception if not supported.
I'd expect the EventLoop never to create threads on his own behalf, it's just wrong.
Here's the way it works. You can call run_in_executor(executor, function, *args) where executor is an executor (a fancy thread pool) that you create. You have full control. However you can pass executor=None and then the event loop will create its own, default executor -- or it will use a default executor that you have created and given to it previously. It needs the default executor so that it can implement getaddrinfo() by calling the stdlib socket.getaddrinfo() in a thread -- and getaddrinfo() is essential for creating transports. The user can take full control over the executor though -- you could set the default to something that always raises an exception.
If you can't provide some functionality without threads, don't provide the functionality.
I don't see this as an absolute requirement. The threads are an implementation detail (other event loop implementations could implement getaddrinfo() differently, taking directly to DNS using tasklets or callbacks), and you can control its use of threads.
Besides, getaddrinfo() is a bad choice, as it relies on distribution specific flags. For example ip6 link local scope exists on every current platform, but - when resolving an link local scope address -not domain- with getaddrinfo, getaddrinfo will fail if no global routed ipv6 address is available on debian/ubuntu.
Nevertheless it is the only thing available in the stdlib. If you want to improve it, that's fine, but just use the issue tracker.
As Transport are part of the PEP - some more:
EventLoop * create_transport(protocol_factory, host, port, **kwargs) kwargs requires "local" - local address as tuple like ('fe80::14ad:1680:54e1:6a91%eth0',0) - so you can bind when using ipv6 link local scope. or ('192.168.2.1',5060) - bind local port for udp
Not sure I understand. What socket.connect() (or other API) call parameters does this correspond to? What can't expressed through the host and port parameters?
In case you have multiple interfaces, and multiple gateways, you need to assign the connection to an address - so the kernel knows which interface to use for the connection - else he'd default to "the first" interface. In IPv6 link-local scope you can have multiple addresses in the same subnet fe80:: - IIRC if you want to connect somewhere, you have to either set the scope_id of the remote, or bind the "source" address before - I don't know how to set the scope_id in python, it's in sockaddr_in6.
In terms of socket. it is a bind before a connect.
s = socket.socket(AF_INET6,SOCK_DGRAM,0) s.bind(('fe80::1',0)) s.connect(('fe80::2',4712))
same for ipv4 in case you are multi homed and rely on source based routing.
Ok, this seems a useful option to add to create_transport(). Your example shows SOCK_DGRAM -- is it also relevant for SOCK_STREAM?
Handler: Requiring 2 handlers for every active connection r/w is highly ineffective.
How so? What is the concern?
Of course you can fold the fdsets, but in case you need a seperate handler for write, you re-create it for every write - see below.
That would seem to depend on the write rate.
Additionally, I can .stop() the handler without having to know the fd, .stop() the handler, change the events the handler is looking for, restart the handler with .start(). In your proposal, I'd create a new handler every time I want to sent something, poll for readability - discard the handler when I'm done, create a new one for the next sent.
The questions are, does it make any difference in efficiency (when using Python -- the performance of the C API is hardly relevant here), and how often does this pattern occur.
Every time you send - you poll for write-ability, you get the callback, you write, you got nothing left, you stop polling for write-ability.
That's not quite how it's implemented. The code first tries to send without polling. Since the socket is non-blocking, if this succeeds, great -- only if it returns a partial send or EAGAIN we register a callback. If the protocol keeps the buffer filled the callback doesn't have to be recreated each time. If the protocol doesn't keep the buffer full, we must unregister the callback to prevent select/poll/etc. from calling it over and over again, there's nothing you can do about that.
* reconnect() - I'd love to be able to reconnect a transport
But what does that mean in general? It depends on the protocol (e.g. FTP, HTTP, IRC, SMTP) how much state must be restored/renegotiated upon a reconnect, and how much data may have to be re-sent. This seems a higher-level feature that transports and protocols will have to implement themselves.
I don't need the EventLoop to sync my state upon reconnect - just have the Transport providing the ability. Protocols are free to use this, but do not have to.
Aha, I get it. You want to be able to call transport.reconnect() from connection_lost() and it should respond by eventually calling protocol.connection_made(transport) again. Of course, this only applies to clients -- for a server to reconnect to a client makes no sense (it would be up to the client). That seems simple enough to implement, but Glyph recommended strongly against this, because reusing the protocol object often means that some private state of the protocol may not be properly reinitialized. It would also be difficult to decide where errors from the reconnect attempt should go -- reconnect() itself must return immediately (since connection_lost() cannot wait for I/O, it can only schedule async I/O events). But at a higher level in your app it would be easy to set this up: you just call eventloop.create_transport(lambda: protocol, ...) where protocol is a protocol instance you've created earlier.
Twisted suggested something here which I haven't implemented yet but which seems reasonable -- using a series of short timeouts try connecting to the various addresses and keep the first one that connects successfully. If multiple addresses connect after the first timeout, too bad, just close the redundant sockets, little harm is done (though the timeouts should be tuned that this is relatively rare, because a server may waste significant resources on such redundant connects).
Fast, yes - reasonable? - no. How would you feel if web browsers behaved like this?
I have no idea -- who says they aren't doing this? Browsers do tons of stuff that I am not aware of.
domain name has to be resolved, addresses ordered according to rfc X which says prefer ipv6 etc., try connecting linear.
Sure. It was just an idea. I'll see what Twisted actually does.
For almost all the timeouts I mentioned - the protocol needs to take care - so the protocol has to exist before the connection is established in case of outbound connections.
I'm not sure I follow. Can you sketch out some code to help me here? ISTM that e.g. the DNS, connect and handshake timeouts can be implemented by the machinery that tries to set up the connection behind the scenes, and the user's protocol won't know anything of these shenanigans. The code that calls create_transport() (actually it'll probably be renamed create_client()) will just get a Future that either indicates success (and then the protocol and transport are successfully hooked up) or an error (and then no protocol was created -- whether or not a transport was created is an implementation detail).
From my understanding the Future does not provide any information which connection to which host using which protocol and credentials failed?
That's not up to the Future -- it just passes an exception object along. We could make this info available as attributes on the exception object, if there is a need.
I'd create the Procotol when trying to create a connection, so the Protocol is informed when the Transport fails and can take action - retry, whatever.
I had this in an earlier version, but Glyph convinced me that this is the wrong design -- and it doesn't work for servers anyway, you must have a protocol factory there.
* data_received(data) - if it was possible to return the number of bytes consumed by the protocol, and have the Transport buffer the rest for the next io in call, one would avoid having to do this in every Protocol on it's own - learned from experience.
Twisted has a whole slew of protocol implementation subclasses that implement various strategies like line-buffering (including a really complex version where you can turn the line buffering on and off) and "netstrings". I am trying to limit the PEP's size by not including these, but I fully expect that in practice a set of useful protocol implementations will be created that handles common cases. I'm not convinced that putting this in the transport/protocol interface will make user code less buggy: it seems easy for the user code to miscount the bytes or not return a count at all in a rarely taken code branch.
Please don't drop this.
You never know how much data you'll receive, you never know how much data you need for a message, so the Protocol needs a buffer.
That all depends on what the protocol is trying to do. (The ECHO protocol certainly doesn't need a buffer. :-)
Having this io in buffer in the Transports allows every Protocol to benefit, they try to read a message from the data passed to data_received(), if the data received is not sufficient to create a full message, they need to buffer it and wait for more data.
Having it in a Protocol base class also allows every protocol that wants it to benefit, without complicating the transport. I can also see problems where the transport needs to keep calling data_received() until either all data is consumed or it returns 0 (no data consumed). It just doesn't seem right to make the transport responsible for this logic, since it doesn't know enough about the needs of the protocol.
So having the Protocol.data_received return the number of bytes the Protocol could process, the Transport can do the job, saving it for every Protocol. Still - a protocol can have it's own buffering strategy, i.e. in case of a incremental XML parser which does it's own buffering, and always return len(data), so the Transport does not buffer anything.
Right, data_received() is closely related to the concept of a "feed parser" which is used in a few places in the stdlib (http://docs.python.org/3/search.html?q=feed&check_keywords=yes&area=default) and even has a 3rd party implementation (http://pypi.python.org/pypi/feedparser/), and there the parser (i.e. the protocol equivalent) is always responsible for buffering data it cannot immediately process.
Next, what happens if a dns can not be resolved, ssl handshake (in case ssl is transport) or connecting fails - in my opinion it's an error the protocol is supposed to take care of + error_dns + error_ssl + error_connecting
The future returned by create_transport() (aka create_client()) will raise the exception.
When do I get this exception - the EventLoop.run() raises?
No, the eventloop doesn't normally raise, just whichever task is waiting for that future using 'yield from' will get the exception. Or you can use eventloop.run_until_complete(<future>) and then that call will raise. -- --Guido van Rossum (python.org/~guido)
participants (4)
-
Antoine Pitrou
-
Guido van Rossum
-
Markus
-
Richard Oudkerk