[Async-sig] APIs for high-bandwidth large I/O?

Sat Oct 21 05:53:01 EDT 2017

On Fri, 20 Oct 2017 14:31:22 -0500
David Beazley <dave at dabeaz.com> wrote:
> I adapted this benchmark to Curio using streams and Curio's support for readinto().  Code is at https://gist.github.com/dabeaz/999dc7d08ddd2c0dea790de67948e756
> Support for readinto() is somewhat recent in Curio so for testing, you will need the latest version from Github (https://github.com/dabeaz/curio).  However, here are the results I got on my machine:
> 
> - vanilla asyncio archieves 145 MB/s
> - asyncio + uvloop achieves 340 MB/s
> - Curio achieves 550 MB/s

Thank you Dave!  I ran it on my machine and get roughly 910 MB/s, i.e.
more or less the same speed as a tweaked Tornado using readinto.

I also wrote a variant of the benchmark using socketserver and plain
sockets.  Its gets 1150 MB/s, and most time seems spent in the kernel,
so I'm not quite sure if it's possible to improve over that:
https://gist.github.com/pitrou/3ac31e82b4461cbc9b4eee151a47bfee

(note that running the server in a separate process doesn't improve
things; neither does using writev() or sendmsg() to send the two
buffers at once)

Regards

Antoine.

> 
> Asyncio tests were run using:  https://gist.github.com/pitrou/719e73c1df51e817d618186833a6e2cc
> 
> Cheers,
> Dave
> 
> > On Oct 18, 2017, at 1:04 PM, Antoine Pitrou <solipsis at pitrou.net> wrote:
> > 
> > 
> > Hi,
> > 
> > I am currently looking into ways to optimize large data transfers for a
> > distributed computing framework
> > (https://github.com/dask/distributed/).  We are using Tornado but the
> > question is more general, as it turns out that certain kinds of API are
> > an impediment to such optimizations.
> > 
> > To put things short, there are a couple benchmarks discussed here:
> > https://github.com/tornadoweb/tornado/issues/2147#issuecomment-337187960
> > 
> > - for Tornado, this benchmark:
> > https://gist.github.com/pitrou/0f772867008d861c4aa2d2d7b846bbf0
> > - for asyncio, this benchmark:
> > https://gist.github.com/pitrou/719e73c1df51e817d618186833a6e2cc
> > 
> > Both implement a trivial form of framing using the "preferred" APIs of
> > each framework (IOStream for Tornado, Protocol for asyncio), and then
> > benchmark it over 100 MB frames using a simple echo client/server.
> > 
> > The results (on Python 3.6) are interesting:
> > - vanilla asyncio achieves 350 MB/s
> > - vanilla Tornado achieves 400 MB/s
> > - asyncio + uvloop achieves 600 MB/s
> > - an optimized Tornado IOStream with a more sophisticated buffering
> >  logic (https://github.com/tornadoweb/tornado/pull/2166)
> >  achieves 700 MB/s
> > 
> > The latter result is especially interesting.  uvloop uses hand-crafted
> > Cython code + the C libuv library, still, a pure Python version of
> > Tornado does better thanks to an improved buffering logic in the
> > streaming layer.
> > 
> > Even the Tornado result is not ideal.  When profiling, we see that
> > 50% of the runtime is actual IO calls (socket.send and socket.recv),
> > but the rest is still overhead.  Especially, buffering on the read side
> > still has costly memory copies (b''.join calls take 22% of the time!).
> > 
> > For a framed layer, you shouldn't need so many copies.  Once you've
> > read the frame length, you can allocate the frame upfront and read into
> > it.  It is at odds, however, with the API exposed by asyncio's Protocol:
> > data_received() gives you a new bytes object as soon as data arrives.
> > It's already too late: a spurious memory copy will have to occur.
> > 
> > Tornado's IOStream is less constrained, but it supports too many read
> > schemes (including several types of callbacks).  So I crafted a limited
> > version of IOStream (*) that supports little functionality, but is able
> > to use socket.recv_into() when asked for a given number of bytes.  When
> > benchmarked, this version achieves 950 MB/s. This is still without C
> > code!
> > 
> > (*) see
> > https://github.com/tornadoweb/tornado/compare/master...pitrou:stream_readinto?expand=1
> > 
> > When profiling that limited version of IOStream, we see that 68% of the
> > runtime is actual IO calls (socket.send and socket.recv_into).
> > Still, 21% of the total runtime is spent allocating a 100 MB buffer for
> > each frame!  That's 70% of the non-IO overhead!  Whether or not there
> > are smart ways to reuse that writable buffer depends on how the
> > application intends to use data: does it throw it away before the next
> > read or not?  It doesn't sound easily doable in the general case.
> > 
> > 
> > So I'm wondering which kind of APIs async libraries could expose to
> > make those use cases faster.  I know curio and trio have socket objects
> > which would probably fit the bill.  I don't know if there are
> > higher-level concepts that may be as adequate for achieving the highest
> > performance.
> > 
> > Also, since asyncio is the de facto standard now, I wonder if asyncio
> > might grow such a new API.  That may be troublesome: asyncio already
> > has Protocols and Streams, and people often complain about its
> > extensive API surface that's difficult for beginners :-)
> > 
> > 
> > Addendum: asyncio streams
> > -------------------------
> > 
> > I didn't think asyncio streams would be a good solution, but I still
> > wrote a benchmark variant for them out of curiosity, and it turns out I
> > was right.  The results:
> > - vanilla asyncio streams achieve 300 MB/s
> > - asyncio + uvloop streams achieve 550 MB/s
> > 
> > The benchmark script is at
> > https://gist.github.com/pitrou/202221ca9c9c74c0b48373ac89e15fd7
> > 
> > Regards
> > 
> > Antoine.
> > 
> > 
> > _______________________________________________
> > Async-sig mailing list
> > Async-sig at python.org
> > https://mail.python.org/mailman/listinfo/async-sig
> > Code of Conduct: https://www.python.org/psf/codeofconduct/  
> 
> _______________________________________________
> Async-sig mailing list
> Async-sig at python.org
> https://mail.python.org/mailman/listinfo/async-sig
> Code of Conduct: https://www.python.org/psf/codeofconduct/
>