I adapted this benchmark to Curio using streams and Curio's support for readinto(). Code is at https://gist.github.com/dabeaz/999dc7d08ddd2c0dea790de67948e756 Support for readinto() is somewhat recent in Curio so for testing, you will need the latest version from Github (https://github.com/dabeaz/curio). However, here are the results I got on my machine: - vanilla asyncio archieves 145 MB/s - asyncio + uvloop achieves 340 MB/s - Curio achieves 550 MB/s Asyncio tests were run using: https://gist.github.com/pitrou/719e73c1df51e817d618186833a6e2cc Cheers, Dave
On Oct 18, 2017, at 1:04 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Hi,
I am currently looking into ways to optimize large data transfers for a distributed computing framework (https://github.com/dask/distributed/). We are using Tornado but the question is more general, as it turns out that certain kinds of API are an impediment to such optimizations.
To put things short, there are a couple benchmarks discussed here: https://github.com/tornadoweb/tornado/issues/2147#issuecomment-337187960
- for Tornado, this benchmark: https://gist.github.com/pitrou/0f772867008d861c4aa2d2d7b846bbf0 - for asyncio, this benchmark: https://gist.github.com/pitrou/719e73c1df51e817d618186833a6e2cc
Both implement a trivial form of framing using the "preferred" APIs of each framework (IOStream for Tornado, Protocol for asyncio), and then benchmark it over 100 MB frames using a simple echo client/server.
The results (on Python 3.6) are interesting: - vanilla asyncio achieves 350 MB/s - vanilla Tornado achieves 400 MB/s - asyncio + uvloop achieves 600 MB/s - an optimized Tornado IOStream with a more sophisticated buffering logic (https://github.com/tornadoweb/tornado/pull/2166) achieves 700 MB/s
The latter result is especially interesting. uvloop uses hand-crafted Cython code + the C libuv library, still, a pure Python version of Tornado does better thanks to an improved buffering logic in the streaming layer.
Even the Tornado result is not ideal. When profiling, we see that 50% of the runtime is actual IO calls (socket.send and socket.recv), but the rest is still overhead. Especially, buffering on the read side still has costly memory copies (b''.join calls take 22% of the time!).
For a framed layer, you shouldn't need so many copies. Once you've read the frame length, you can allocate the frame upfront and read into it. It is at odds, however, with the API exposed by asyncio's Protocol: data_received() gives you a new bytes object as soon as data arrives. It's already too late: a spurious memory copy will have to occur.
Tornado's IOStream is less constrained, but it supports too many read schemes (including several types of callbacks). So I crafted a limited version of IOStream (*) that supports little functionality, but is able to use socket.recv_into() when asked for a given number of bytes. When benchmarked, this version achieves 950 MB/s. This is still without C code!
(*) see https://github.com/tornadoweb/tornado/compare/master...pitrou:stream_readint...
When profiling that limited version of IOStream, we see that 68% of the runtime is actual IO calls (socket.send and socket.recv_into). Still, 21% of the total runtime is spent allocating a 100 MB buffer for each frame! That's 70% of the non-IO overhead! Whether or not there are smart ways to reuse that writable buffer depends on how the application intends to use data: does it throw it away before the next read or not? It doesn't sound easily doable in the general case.
So I'm wondering which kind of APIs async libraries could expose to make those use cases faster. I know curio and trio have socket objects which would probably fit the bill. I don't know if there are higher-level concepts that may be as adequate for achieving the highest performance.
Also, since asyncio is the de facto standard now, I wonder if asyncio might grow such a new API. That may be troublesome: asyncio already has Protocols and Streams, and people often complain about its extensive API surface that's difficult for beginners :-)
Addendum: asyncio streams -------------------------
I didn't think asyncio streams would be a good solution, but I still wrote a benchmark variant for them out of curiosity, and it turns out I was right. The results: - vanilla asyncio streams achieve 300 MB/s - asyncio + uvloop streams achieve 550 MB/s
The benchmark script is at https://gist.github.com/pitrou/202221ca9c9c74c0b48373ac89e15fd7
Regards
Antoine.
_______________________________________________ Async-sig mailing list Async-sig@python.org https://mail.python.org/mailman/listinfo/async-sig Code of Conduct: https://www.python.org/psf/codeofconduct/