Hi Antoine,

Thanks for posting this, and sorry for the delayed reply!

I've known about a possibility to optimize asyncio Protocols for a while.  I noticed that `Protocol.data_received()` requires making one extra copy of the received data when I was working on the initial version of uvloop.  Back then my main priority was to make uvloop fully compatible with asyncio, so I wasn't really thinking about improving asyncio design.


Let me explain the current flaw of `Protocol.data_received()` so that other people on the list can catch up with the discussion: 

1. Currently, when a Transport is reading data, it uses `sock.recv()` call, which returns a `bytes` object, which is then pushed to `Protocol.data_received()`.  Every time `sock.recv()` is called, a new bytes object is allocated.

2. Typically, protocols need to accumulate bytes objects they receive until they have enough buffered data to be parsed.  Usually a `deque` is used for that, less optimized code just concatenates all bytes objects into one.

3. When enough data is gathered and a protocol message can be parsed out of it, usually there's a need to concatenate a few buffers from the `deque` or get a slice of the concatenated buffer.  At this point, we've copied the received data two times.


I propose to add another Protocol base class to asyncio: BufferedProtocol.  It won't have the 'data_received()' method, instead it will have 'get_buffer()' and 'buffer_updated(nbytes)' methods:

    class asyncio.BufferedProtocol:

        def get_buffer(self) -> memoryview:
            pass

        def buffer_updated(self, nbytes: int):
            pass

When the protocol's transport is ready to receive data, it will call `protocol.get_buffer()`.  The latter must return an object that implements the buffer protocol.  The transport will request a writable buffer over the returned object and receive data *into* that buffer.

When the `sock.recv_into(buffer)` call is done, `protocol.buffer_updated(nbytes)` method will be called.  The number of bytes received into the buffer will be passed as a first argument.


I've implemented the proposed design in uvloop (branch 'get_buffer', [1]) and adjusted your benchmark [2] to use it.  Here are benchmark results from my machine (macOS):

vanilla asyncio: 120-135 Mb/s
uvloop: 320-330 Mb/s
uvloop/get_buffer: 600-650 Mb/s.

The benchmark is quite unstable, but it's clear that Protocol.get_buffer() allows to implement framing way more efficiently.


I'm also working on porting asyncpg library to use get_buffer(), as it has a fairly good benchmark suite.  So far I'm seeing 5-15% speed boost on all benchmarks.  What's more important is that get_buffer() makes asyncpg buffer implementation simpler!


I'm quite happy with these results and Ipropose to implement the get_buffer() API (or its equivalent) in Python 3.7.  I've opened an issue [3] to discuss the implementation details.


[1] https://github.com/MagicStack/uvloop/tree/get_buffer
[2] https://gist.github.com/1st1/1c606e5b83ef0e9c41faf21564d75ad7


Thanks,
Yury

On Oct 18, 2017, 2:31 PM -0400, Antoine Pitrou <solipsis@pitrou.net>, wrote:

Hi,

I am currently looking into ways to optimize large data transfers for a
distributed computing framework
(https://github.com/dask/distributed/). We are using Tornado but the
question is more general, as it turns out that certain kinds of API are
an impediment to such optimizations.

To put things short, there are a couple benchmarks discussed here:
https://github.com/tornadoweb/tornado/issues/2147#issuecomment-337187960

- for Tornado, this benchmark:
https://gist.github.com/pitrou/0f772867008d861c4aa2d2d7b846bbf0
- for asyncio, this benchmark:
https://gist.github.com/pitrou/719e73c1df51e817d618186833a6e2cc

Both implement a trivial form of framing using the "preferred" APIs of
each framework (IOStream for Tornado, Protocol for asyncio), and then
benchmark it over 100 MB frames using a simple echo client/server.

The results (on Python 3.6) are interesting:
- vanilla asyncio achieves 350 MB/s
- vanilla Tornado achieves 400 MB/s
- asyncio + uvloop achieves 600 MB/s
- an optimized Tornado IOStream with a more sophisticated buffering
logic (https://github.com/tornadoweb/tornado/pull/2166)
achieves 700 MB/s

The latter result is especially interesting. uvloop uses hand-crafted
Cython code + the C libuv library, still, a pure Python version of
Tornado does better thanks to an improved buffering logic in the
streaming layer.

Even the Tornado result is not ideal. When profiling, we see that
50% of the runtime is actual IO calls (socket.send and socket.recv),
but the rest is still overhead. Especially, buffering on the read side
still has costly memory copies (b''.join calls take 22% of the time!).

For a framed layer, you shouldn't need so many copies. Once you've
read the frame length, you can allocate the frame upfront and read into
it. It is at odds, however, with the API exposed by asyncio's Protocol:
data_received() gives you a new bytes object as soon as data arrives.
It's already too late: a spurious memory copy will have to occur.

Tornado's IOStream is less constrained, but it supports too many read
schemes (including several types of callbacks). So I crafted a limited
version of IOStream (*) that supports little functionality, but is able
to use socket.recv_into() when asked for a given number of bytes. When
benchmarked, this version achieves 950 MB/s. This is still without C
code!

(*) see
https://github.com/tornadoweb/tornado/compare/master...pitrou:stream_readinto?expand=1

When profiling that limited version of IOStream, we see that 68% of the
runtime is actual IO calls (socket.send and socket.recv_into).
Still, 21% of the total runtime is spent allocating a 100 MB buffer for
each frame! That's 70% of the non-IO overhead! Whether or not there
are smart ways to reuse that writable buffer depends on how the
application intends to use data: does it throw it away before the next
read or not? It doesn't sound easily doable in the general case.


So I'm wondering which kind of APIs async libraries could expose to
make those use cases faster. I know curio and trio have socket objects
which would probably fit the bill. I don't know if there are
higher-level concepts that may be as adequate for achieving the highest
performance.

Also, since asyncio is the de facto standard now, I wonder if asyncio
might grow such a new API. That may be troublesome: asyncio already
has Protocols and Streams, and people often complain about its
extensive API surface that's difficult for beginners :-)


Addendum: asyncio streams
-------------------------

I didn't think asyncio streams would be a good solution, but I still
wrote a benchmark variant for them out of curiosity, and it turns out I
was right. The results:
- vanilla asyncio streams achieve 300 MB/s
- asyncio + uvloop streams achieve 550 MB/s

The benchmark script is at
https://gist.github.com/pitrou/202221ca9c9c74c0b48373ac89e15fd7

Regards

Antoine.


_______________________________________________
Async-sig mailing list
Async-sig@python.org
https://mail.python.org/mailman/listinfo/async-sig
Code of Conduct: https://www.python.org/psf/codeofconduct/