<html xmlns="http://www.w3.org/1999/xhtml">

<head>

<title></title>

</head>

<body>

<div name="messageBodySection" style="font-size: 14px; font-family: -apple-system, BlinkMacSystemFont, sans-serif;">Hi Antoine,

<div><br /></div>

<div>Thanks for posting this, and sorry for the delayed reply!</div>

<div><br /></div>

<div>I've known about a possibility to optimize asyncio Protocols for a while.  I noticed that `Protocol.data_received()` requires making one extra copy of the received data when I was working on the initial version of uvloop.  Back then my main priority was to make uvloop fully compatible with asyncio, so I wasn't really thinking about improving asyncio design.</div>

<div><br /></div>

<div><br /></div>

<div>Let me explain the current flaw of `Protocol.data_received()` so that other people on the list can catch up with the discussion: <br /></div>

<div><br /></div>

<div>1. Currently, when a Transport is reading data, it uses `sock.recv()` call, which returns a `bytes` object, which is then pushed to `Protocol.data_received()`.  Every time `sock.recv()` is called, a new bytes object is allocated.</div>

<div><br /></div>

<div>2. Typically, protocols need to accumulate bytes objects they receive until they have enough buffered data to be parsed.  Usually a `deque` is used for that, less optimized code just concatenates all bytes objects into one.</div>

<div><br /></div>

<div>3. When enough data is gathered and a protocol message can be parsed out of it, usually there's a need to concatenate a few buffers from the `deque` or get a slice of the concatenated buffer.  At this point, we've copied the received data two times.</div>

<div><br /></div>

<div><br /></div>

<div>I propose to add another Protocol base class to asyncio: BufferedProtocol.  It won't have the 'data_received()' method, instead it will have 'get_buffer()' and 'buffer_updated(nbytes)' methods:</div>

<div><br /></div>

<div>    class asyncio.BufferedProtocol:</div>

<div><br /></div>

<div>        def get_buffer(self) -> memoryview:</div>

<div>            pass</div>

<div><br /></div>

<div>        def buffer_updated(self, nbytes: int):</div>

<div>            pass</div>

<div><br /></div>

<div>When the protocol's transport is ready to receive data, it will call `protocol.get_buffer()`.  The latter must return an object that implements the buffer protocol.  The transport will request a writable buffer over the returned object and receive data *into* that buffer.</div>

<div><br /></div>

<div>When the `sock.recv_into(buffer)` call is done, `protocol.buffer_updated(nbytes)` method will be called.  The number of bytes received into the buffer will be passed as a first argument.</div>

<div><br /></div>

<div><br /></div>

<div>I've implemented the proposed design in uvloop (branch 'get_buffer', [1]) and adjusted your benchmark [2] to use it.  Here are benchmark results from my machine (macOS):</div>

<div><br /></div>

<div>vanilla asyncio: 120-135 Mb/s</div>

<div>uvloop: 320-330 Mb/s</div>

<div>uvloop/get_buffer: 600-650 Mb/s.</div>

<div><br /></div>

<div>The benchmark is quite unstable, but it's clear that Protocol.get_buffer() allows to implement framing way more efficiently.<br /></div>

<div><br /></div>

<div><br /></div>

<div>I'm also working on porting asyncpg library to use get_buffer(), as it has a fairly good benchmark suite.  So far I'm seeing 5-15% speed boost on all benchmarks.  What's more important is that get_buffer() makes asyncpg buffer implementation simpler!</div>

<div><br /></div>

<div><br /></div>

<div>I'm quite happy with these results and Ipropose to implement the get_buffer() API (or its equivalent) in Python 3.7.  I've opened an issue [3] to discuss the implementation details.</div>

<div><br /></div>

<div><br /></div>

<div>[1] <a href="https://github.com/MagicStack/uvloop/tree/get_buffer">https://github.com/MagicStack/uvloop/tree/get_buffer</a></div>

<div>[2] <a href="https://gist.github.com/1st1/1c606e5b83ef0e9c41faf21564d75ad7">https://gist.github.com/1st1/1c606e5b83ef0e9c41faf21564d75ad7</a><br /></div>

<div><br /></div>

</div>

<div name="messageSignatureSection" style="font-size: 14px; font-family: -apple-system, BlinkMacSystemFont, sans-serif;"><br />

Thanks,<br />

Yury</div>

<div name="messageReplySection" style="font-size: 14px; font-family: -apple-system, BlinkMacSystemFont, sans-serif;"><br />

On Oct 18, 2017, 2:31 PM -0400, Antoine Pitrou <solipsis@pitrou.net>, wrote:<br />

<blockquote type="cite" style="margin: 5px 5px; padding-left: 10px; border-left: thin solid #1abc9c;"><br />

Hi,<br />

<br />

I am currently looking into ways to optimize large data transfers for a<br />

distributed computing framework<br />

(https://github.com/dask/distributed/). We are using Tornado but the<br />

question is more general, as it turns out that certain kinds of API are<br />

an impediment to such optimizations.<br />

<br />

To put things short, there are a couple benchmarks discussed here:<br />

https://github.com/tornadoweb/tornado/issues/2147#issuecomment-337187960<br />

<br />

- for Tornado, this benchmark:<br />

https://gist.github.com/pitrou/0f772867008d861c4aa2d2d7b846bbf0<br />

- for asyncio, this benchmark:<br />

https://gist.github.com/pitrou/719e73c1df51e817d618186833a6e2cc<br />

<br />

Both implement a trivial form of framing using the "preferred" APIs of<br />

each framework (IOStream for Tornado, Protocol for asyncio), and then<br />

benchmark it over 100 MB frames using a simple echo client/server.<br />

<br />

The results (on Python 3.6) are interesting:<br />

- vanilla asyncio achieves 350 MB/s<br />

- vanilla Tornado achieves 400 MB/s<br />

- asyncio + uvloop achieves 600 MB/s<br />

- an optimized Tornado IOStream with a more sophisticated buffering<br />

logic (https://github.com/tornadoweb/tornado/pull/2166)<br />

achieves 700 MB/s<br />

<br />

The latter result is especially interesting. uvloop uses hand-crafted<br />

Cython code + the C libuv library, still, a pure Python version of<br />

Tornado does better thanks to an improved buffering logic in the<br />

streaming layer.<br />

<br />

Even the Tornado result is not ideal. When profiling, we see that<br />

50% of the runtime is actual IO calls (socket.send and socket.recv),<br />

but the rest is still overhead. Especially, buffering on the read side<br />

still has costly memory copies (b''.join calls take 22% of the time!).<br />

<br />

For a framed layer, you shouldn't need so many copies. Once you've<br />

read the frame length, you can allocate the frame upfront and read into<br />

it. It is at odds, however, with the API exposed by asyncio's Protocol:<br />

data_received() gives you a new bytes object as soon as data arrives.<br />

It's already too late: a spurious memory copy will have to occur.<br />

<br />

Tornado's IOStream is less constrained, but it supports too many read<br />

schemes (including several types of callbacks). So I crafted a limited<br />

version of IOStream (*) that supports little functionality, but is able<br />

to use socket.recv_into() when asked for a given number of bytes. When<br />

benchmarked, this version achieves 950 MB/s. This is still without C<br />

code!<br />

<br />

(*) see<br />

https://github.com/tornadoweb/tornado/compare/master...pitrou:stream_readinto?expand=1<br />

<br />

When profiling that limited version of IOStream, we see that 68% of the<br />

runtime is actual IO calls (socket.send and socket.recv_into).<br />

Still, 21% of the total runtime is spent allocating a 100 MB buffer for<br />

each frame! That's 70% of the non-IO overhead! Whether or not there<br />

are smart ways to reuse that writable buffer depends on how the<br />

application intends to use data: does it throw it away before the next<br />

read or not? It doesn't sound easily doable in the general case.<br />

<br />

<br />

So I'm wondering which kind of APIs async libraries could expose to<br />

make those use cases faster. I know curio and trio have socket objects<br />

which would probably fit the bill. I don't know if there are<br />

higher-level concepts that may be as adequate for achieving the highest<br />

performance.<br />

<br />

Also, since asyncio is the de facto standard now, I wonder if asyncio<br />

might grow such a new API. That may be troublesome: asyncio already<br />

has Protocols and Streams, and people often complain about its<br />

extensive API surface that's difficult for beginners :-)<br />

<br />

<br />

Addendum: asyncio streams<br />

-------------------------<br />

<br />

I didn't think asyncio streams would be a good solution, but I still<br />

wrote a benchmark variant for them out of curiosity, and it turns out I<br />

was right. The results:<br />

- vanilla asyncio streams achieve 300 MB/s<br />

- asyncio + uvloop streams achieve 550 MB/s<br />

<br />

The benchmark script is at<br />

https://gist.github.com/pitrou/202221ca9c9c74c0b48373ac89e15fd7<br />

<br />

Regards<br />

<br />

Antoine.<br />

<br />

<br />

_______________________________________________<br />

Async-sig mailing list<br />

Async-sig@python.org<br />

https://mail.python.org/mailman/listinfo/async-sig<br />

Code of Conduct: https://www.python.org/psf/codeofconduct/<br /></blockquote>

</div>

</body>

</html>