[Async-sig] APIs for high-bandwidth large I/O?
yselivanov at gmail.com
Thu Dec 7 21:03:46 EST 2017
Thanks for posting this, and sorry for the delayed reply!
I've known about a possibility to optimize asyncio Protocols for a while. I noticed that `Protocol.data_received()` requires making one extra copy of the received data when I was working on the initial version of uvloop. Back then my main priority was to make uvloop fully compatible with asyncio, so I wasn't really thinking about improving asyncio design.
Let me explain the current flaw of `Protocol.data_received()` so that other people on the list can catch up with the discussion:
1. Currently, when a Transport is reading data, it uses `sock.recv()` call, which returns a `bytes` object, which is then pushed to `Protocol.data_received()`. Every time `sock.recv()` is called, a new bytes object is allocated.
2. Typically, protocols need to accumulate bytes objects they receive until they have enough buffered data to be parsed. Usually a `deque` is used for that, less optimized code just concatenates all bytes objects into one.
3. When enough data is gathered and a protocol message can be parsed out of it, usually there's a need to concatenate a few buffers from the `deque` or get a slice of the concatenated buffer. At this point, we've copied the received data two times.
I propose to add another Protocol base class to asyncio: BufferedProtocol. It won't have the 'data_received()' method, instead it will have 'get_buffer()' and 'buffer_updated(nbytes)' methods:
def get_buffer(self) -> memoryview:
def buffer_updated(self, nbytes: int):
When the protocol's transport is ready to receive data, it will call `protocol.get_buffer()`. The latter must return an object that implements the buffer protocol. The transport will request a writable buffer over the returned object and receive data *into* that buffer.
When the `sock.recv_into(buffer)` call is done, `protocol.buffer_updated(nbytes)` method will be called. The number of bytes received into the buffer will be passed as a first argument.
I've implemented the proposed design in uvloop (branch 'get_buffer', ) and adjusted your benchmark  to use it. Here are benchmark results from my machine (macOS):
vanilla asyncio: 120-135 Mb/s
uvloop: 320-330 Mb/s
uvloop/get_buffer: 600-650 Mb/s.
The benchmark is quite unstable, but it's clear that Protocol.get_buffer() allows to implement framing way more efficiently.
I'm also working on porting asyncpg library to use get_buffer(), as it has a fairly good benchmark suite. So far I'm seeing 5-15% speed boost on all benchmarks. What's more important is that get_buffer() makes asyncpg buffer implementation simpler!
I'm quite happy with these results and Ipropose to implement the get_buffer() API (or its equivalent) in Python 3.7. I've opened an issue  to discuss the implementation details.
On Oct 18, 2017, 2:31 PM -0400, Antoine Pitrou <solipsis at pitrou.net>, wrote:
> I am currently looking into ways to optimize large data transfers for a
> distributed computing framework
> (https://github.com/dask/distributed/). We are using Tornado but the
> question is more general, as it turns out that certain kinds of API are
> an impediment to such optimizations.
> To put things short, there are a couple benchmarks discussed here:
> - for Tornado, this benchmark:
> - for asyncio, this benchmark:
> Both implement a trivial form of framing using the "preferred" APIs of
> each framework (IOStream for Tornado, Protocol for asyncio), and then
> benchmark it over 100 MB frames using a simple echo client/server.
> The results (on Python 3.6) are interesting:
> - vanilla asyncio achieves 350 MB/s
> - vanilla Tornado achieves 400 MB/s
> - asyncio + uvloop achieves 600 MB/s
> - an optimized Tornado IOStream with a more sophisticated buffering
> logic (https://github.com/tornadoweb/tornado/pull/2166)
> achieves 700 MB/s
> The latter result is especially interesting. uvloop uses hand-crafted
> Cython code + the C libuv library, still, a pure Python version of
> Tornado does better thanks to an improved buffering logic in the
> streaming layer.
> Even the Tornado result is not ideal. When profiling, we see that
> 50% of the runtime is actual IO calls (socket.send and socket.recv),
> but the rest is still overhead. Especially, buffering on the read side
> still has costly memory copies (b''.join calls take 22% of the time!).
> For a framed layer, you shouldn't need so many copies. Once you've
> read the frame length, you can allocate the frame upfront and read into
> it. It is at odds, however, with the API exposed by asyncio's Protocol:
> data_received() gives you a new bytes object as soon as data arrives.
> It's already too late: a spurious memory copy will have to occur.
> Tornado's IOStream is less constrained, but it supports too many read
> schemes (including several types of callbacks). So I crafted a limited
> version of IOStream (*) that supports little functionality, but is able
> to use socket.recv_into() when asked for a given number of bytes. When
> benchmarked, this version achieves 950 MB/s. This is still without C
> (*) see
> When profiling that limited version of IOStream, we see that 68% of the
> runtime is actual IO calls (socket.send and socket.recv_into).
> Still, 21% of the total runtime is spent allocating a 100 MB buffer for
> each frame! That's 70% of the non-IO overhead! Whether or not there
> are smart ways to reuse that writable buffer depends on how the
> application intends to use data: does it throw it away before the next
> read or not? It doesn't sound easily doable in the general case.
> So I'm wondering which kind of APIs async libraries could expose to
> make those use cases faster. I know curio and trio have socket objects
> which would probably fit the bill. I don't know if there are
> higher-level concepts that may be as adequate for achieving the highest
> Also, since asyncio is the de facto standard now, I wonder if asyncio
> might grow such a new API. That may be troublesome: asyncio already
> has Protocols and Streams, and people often complain about its
> extensive API surface that's difficult for beginners :-)
> Addendum: asyncio streams
> I didn't think asyncio streams would be a good solution, but I still
> wrote a benchmark variant for them out of curiosity, and it turns out I
> was right. The results:
> - vanilla asyncio streams achieve 300 MB/s
> - asyncio + uvloop streams achieve 550 MB/s
> The benchmark script is at
> Async-sig mailing list
> Async-sig at python.org
> Code of Conduct: https://www.python.org/psf/codeofconduct/
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Async-sig