Re: [Async-sig] APIs for high-bandwidth large I/O?

20 Oct 2017


      I adapted this benchmark to Curio using streams and Curio's support for readinto().  Code is at https://gist.github.com/dabeaz/999dc7d08ddd2c0dea790de67948e756
Support for readinto() is somewhat recent in Curio so for testing, you will need the latest version from Github (https://github.com/dabeaz/curio).  However, here are the results I got on my machine:

- vanilla asyncio archieves 145 MB/s
- asyncio + uvloop achieves 340 MB/s
- Curio achieves 550 MB/s

Asyncio tests were run using:  https://gist.github.com/pitrou/719e73c1df51e817d618186833a6e2cc

Cheers,
Dave
...
On Oct 18, 2017, at 1:04 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Hi,
I am currently looking into ways to optimize large data transfers for a
distributed computing framework
(https://github.com/dask/distributed/).  We are using Tornado but the
question is more general, as it turns out that certain kinds of API are
an impediment to such optimizations.
To put things short, there are a couple benchmarks discussed here:
https://github.com/tornadoweb/tornado/issues/2147#issuecomment-337187960
- for Tornado, this benchmark:
https://gist.github.com/pitrou/0f772867008d861c4aa2d2d7b846bbf0
- for asyncio, this benchmark:
https://gist.github.com/pitrou/719e73c1df51e817d618186833a6e2cc
Both implement a trivial form of framing using the "preferred" APIs of
each framework (IOStream for Tornado, Protocol for asyncio), and then
benchmark it over 100 MB frames using a simple echo client/server.
The results (on Python 3.6) are interesting:
- vanilla asyncio achieves 350 MB/s
- vanilla Tornado achieves 400 MB/s
- asyncio + uvloop achieves 600 MB/s
- an optimized Tornado IOStream with a more sophisticated buffering
 logic (https://github.com/tornadoweb/tornado/pull/2166)
 achieves 700 MB/s
The latter result is especially interesting.  uvloop uses hand-crafted
Cython code + the C libuv library, still, a pure Python version of
Tornado does better thanks to an improved buffering logic in the
streaming layer.
Even the Tornado result is not ideal.  When profiling, we see that
50% of the runtime is actual IO calls (socket.send and socket.recv),
but the rest is still overhead.  Especially, buffering on the read side
still has costly memory copies (b''.join calls take 22% of the time!).
For a framed layer, you shouldn't need so many copies.  Once you've
read the frame length, you can allocate the frame upfront and read into
it.  It is at odds, however, with the API exposed by asyncio's Protocol:
data_received() gives you a new bytes object as soon as data arrives.
It's already too late: a spurious memory copy will have to occur.
Tornado's IOStream is less constrained, but it supports too many read
schemes (including several types of callbacks).  So I crafted a limited
version of IOStream (*) that supports little functionality, but is able
to use socket.recv_into() when asked for a given number of bytes.  When
benchmarked, this version achieves 950 MB/s. This is still without C
code!
(*) see
https://github.com/tornadoweb/tornado/compare/master...pitrou:stream_readint...
When profiling that limited version of IOStream, we see that 68% of the
runtime is actual IO calls (socket.send and socket.recv_into).
Still, 21% of the total runtime is spent allocating a 100 MB buffer for
each frame!  That's 70% of the non-IO overhead!  Whether or not there
are smart ways to reuse that writable buffer depends on how the
application intends to use data: does it throw it away before the next
read or not?  It doesn't sound easily doable in the general case.
So I'm wondering which kind of APIs async libraries could expose to
make those use cases faster.  I know curio and trio have socket objects
which would probably fit the bill.  I don't know if there are
higher-level concepts that may be as adequate for achieving the highest
performance.
Also, since asyncio is the de facto standard now, I wonder if asyncio
might grow such a new API.  That may be troublesome: asyncio already
has Protocols and Streams, and people often complain about its
extensive API surface that's difficult for beginners :-)
Addendum: asyncio streams
-------------------------
I didn't think asyncio streams would be a good solution, but I still
wrote a benchmark variant for them out of curiosity, and it turns out I
was right.  The results:
- vanilla asyncio streams achieve 300 MB/s
- asyncio + uvloop streams achieve 550 MB/s
The benchmark script is at
https://gist.github.com/pitrou/202221ca9c9c74c0b48373ac89e15fd7
Regards
Antoine.
_______________________________________________
Async-sig mailing list
Async-sig@python.org
https://mail.python.org/mailman/listinfo/async-sig
Code of Conduct: https://www.python.org/psf/codeofconduct/

Re: [Async-sig] APIs for high-bandwidth large I/O?

David Beazley