Re: [Python-ideas] Asynchronous IO ideas for Python

Nov. 27, 2014


      2014-11-27 1:18 GMT+00:00 Trent Nelson <trent@snakebite.org>:
...
Everything else is just normal Python, nothing special -- it just conforms
to the current constraints of PyParallel.  Basically, the
HttpServer.data_received() method will be invoked from parallel threads, not
the main interpreter thread.
So, still no garbage collection from the threads?
...
To give you an idea how the protocol/transport stuff is wired up, the
standalone launcher stuff is at the bottom of that file:
ipaddr = socket.gethostbyname(socket.gethostname())
server = async.server(ipaddr, 8080)
async.register(transport=server, protocol=HttpServer)
async.run()
As for why I haven't publicized this stuff until now, to quote myself from
that video... "It currently crashes, a lot.  I know why it's crashing, I
just haven't had the time to fix it yet.  But hey, it's super fast until it
does crash ;-)"
By crash, I mean I'm hitting an assert() in my code -- it happens after the
benchmark runs and has to do with the asynchronous socket disconnect logic.
I tried fixing it properly before giving that talk, but ran out of time
(https://bitbucket.org/tpn/pyparallel/branch/3.3-px-pygotham-2014-sprint).
I'll fix all of that the next sprint... which will be... heh, hopefully
around Christmas?
Oh, actually, the big takeaway from the PyGotham sprint was that I spent an
evening re-applying all the wild commit and hackery I'd accumulated to a
branch created from the 3.3.5 tag:
https://bitbucket.org/tpn/pyparallel/commits/branch/3.3-px.  So diff that
against the 3.3.5 tag to get an idea of what interpreter changes I needed to
make to get to this point.  (I have no idea why I didn't pick a tag to work
off when I first started -- I literally just started hacking on whatever my
local tip was on, which was some indeterminate state between... 3.2 and
3.3?)
Side note: I'm really happy with how everything has worked out so far, it is
exactly how I envisioned it way back in those python-ideas@ discussions that
resulted in tulip/asyncio.  I was seeing ridiculously good scaling on my
beefier machine at home (8 core, running native) -- to the point where I was
maxing out the client machine at about 50,000 requests/sec (~100MB/s) and
the PyParallel box was only at about 40% CPU use.
Oh, and it appears to be much faster than node.js's http-server too (`npm
install http-server`, cd into the website directory, `http-server -s .` to
get an equivalent HTTP server from node.js), which I thought was cute.
Well, I expected it to be, that's the whole point of being able to exploit
all cores and not doing single threaded multiplexing -- so it was good to
see that being the case.
Node wasn't actually that much faster than Python's normal http.server if I
remember correctly.  It definitely used less CPU overall than the Python one
-- basically what I'm seeing is that Python will be maxing out one core,
which should only be 25% CPU (4 core VM), but actual CPU use is up around
50%, and it's mostly kernel time making up the other half.  Node will also
max out a core, but overall CPU use is ~30%.  I attribute this to Python's
http.server using select(), whereas I believe node.js ends up using IOCP in
a single-threaded event loop.  So, you could expect Python asyncio to get
similar performance to node, but they're both crushed by PyParallel (until
it crashes, heh) as soon as you've got more than one core, which was the
point I've been vehemently making from day one ;-)
And I just realized I'm writing this e-mail on the same laptop that did that
demo, so I can actually back all of this up with a quick run now.
Python 3.3
On Windows:
C:\Users\Trent\src\pyparallel-0.1-3.3.5
λ python33-http-server.bat
Serving HTTP on 0.0.0.0 port 8000 ...
On Mac:
(trent@raptor:ttys003) (Wed/19:06) .. (~s/wrk)
% ./wrk -c 8 -t 2 -d 10 --latency http://192.168.46.131:8000/index.html
Running 10s test @ http://192.168.46.131:8000/index.html
2 threads and 8 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 6.33ms 1.74ms 18.42ms 75.65%
Req/Sec 419.77 119.93 846.00 67.43%
Latency Distribution
50% 6.26ms
75% 7.15ms
90% 8.21ms
99% 12.42ms
8100 requests in 10.00s, 53.48MB read
Requests/sec: 809.92
Transfer/sec: 5.35MB
Node.js
On Windows:
C:\Users\Trent\src\pyparallel-0.1-3.3.5\website
λ http-server -s .
On Mac:
(trent@raptor:ttys003) (Wed/19:07) .. (~s/wrk)
% ./wrk -c 8 -t 2 -d 10 --latency http://192.168.46.131:8080/index.html
Running 10s test @ http://192.168.46.131:8080/index.html
2 threads and 8 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 6.44ms 2.40ms 19.70ms 84.77%
Req/Sec 621.94 124.26 0.94k 68.05%
Latency Distribution
50% 5.93ms
75% 7.00ms
90% 8.97ms
99% 16.17ms
12021 requests in 10.00s, 80.84MB read
Requests/sec: 1201.98
Transfer/sec: 8.08MB
PyParallel
On Windows:
C:\Users\Trent\src\pyparallel-0.1-3.3.5
λ pyparallel-http-server.bat
Serving HTTP on 192.168.46.131 port 8080 ...
Traceback (most recent call last):
  File "C:\Users\Trent\src\pyparallel-0.1-3.3.5\Lib\runpy.py", line 160, in
_run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "C:\Users\Trent\src\pyparallel-0.1-3.3.5\Lib\runpy.py", line 73, in
_run_code
    exec(code, run_globals)
  File "C:\Users\Trent\src\pyparallel-0.1-3.3.5\Lib\ctk\cli.py", line 518,
in <module>
    cli = run(*args)
  File "C:\Users\Trent\src\pyparallel-0.1-3.3.5\Lib\ctk\cli.py", line 488,
in run
    return CLI(*args, **kwds)
  File "C:\Users\Trent\src\pyparallel-0.1-3.3.5\Lib\ctk\cli.py", line 272,
in __init__
    self.run()
  File "C:\Users\Trent\src\pyparallel-0.1-3.3.5\Lib\ctk\cli.py", line 278,
in run
    self._process_commandline()
  File "C:\Users\Trent\src\pyparallel-0.1-3.3.5\Lib\ctk\cli.py", line 424,
in _process_commandline
    cl.run(args)
  File "C:\Users\Trent\src\pyparallel-0.1-3.3.5\Lib\ctk\cli.py", line 217,
in run
    self.command.start()
  File "C:\Users\Trent\src\pyparallel-0.1-3.3.5\Lib\ctk\command.py", line
455, in start
    self.run()
  File "C:\Users\Trent\src\pyparallel-0.1-3.3.5\Lib\px\commands.py", line
90, in run
    async.run()
OSError: [WinError 8] Not enough storage is available to process this
command
_PyParallel_Finalize(): px->contexts_active: 462
[92105 refs]
_PyParallel_DeletingThreadState(): px->contexts_active: 462
Oh dear :-)  Hadn't seen that before.  The VM has 4GB allocated to it... I
checked taskmgr and it was reporting ~90% physical memory use.  Closed a
bunch of things and got it down to 54%, then re-ran, that did the trick.
Including this info in case anyone else runs into this.
Re-run:
C:\Users\Trent\src\pyparallel-0.1-3.3.5
λ pyparallel-http-server.bat
Serving HTTP on 192.168.46.131 port 8080 ...
On Mac:
(trent@raptor:ttys003) (Wed/19:16) .. (~s/wrk)
% ./wrk -c 8 -t 2 -d 10 --latency http://192.168.46.131:8080/index.html
Running 10s test @ http://192.168.46.131:8080/index.html
2 threads and 8 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 4.04ms 1.80ms 23.35ms 91.16%
Req/Sec 1.07k 191.81 1.54k 75.00%
Latency Distribution
50% 3.68ms
75% 4.41ms
90% 5.40ms
99% 13.04ms
20317 requests in 10.00s, 134.22MB read
Requests/sec: 2031.33
Transfer/sec: 13.42MB
And then back on Windows after the benchmark completes:
C:\Users\Trent\src\pyparallel-0.1-3.3.5
λ pyparallel-http-server.bat
Serving HTTP on 192.168.46.131 port 8080 ...
Assertion failed: s->io_op == PxSocket_IO_SEND, file ..\Python\pyparallel.c,
line 6311
Assertion failed: s->io_op == PxSocket_IO_SEND, file ..\Python\pyparallel.c,
line 6311
Assertion failed: s->io_op == PxSocket_IO_SEND, file ..\Python\pyparallel.c,
line 6311
Assertion failed: s->io_op == PxSocket_IO_SEND, file ..\Python\pyparallel.c,
line 6311
Assertion failed: s->io_op == PxSocket_IO_SEND, file ..\Python\pyparallel.c,
line 6311
Assertion failed: s->io_op == PxSocket_IO_SEND, file ..\Python\pyparallel.c,
line 6311
Heh.  That's the crashing I was referring to.
So basically, it's better in every category (lowest latency, lowest jitter
(stddev), highest throughput) for the duration of the benchmark, then
crashes :-)
(Basically, my DisconnectEx assumptions regarding overlapped sockets, socket
resource reuse, I/O completion ports, and thread pools were... not correct
apparently.)
I remain committed to the assertion that the Windows' kernel approach to
asynchronous I/O via I/O completion ports is fundamentally superior to the
UNIX/POSIX approach in every aspect if you want to optimally use
contemporary multicore hardware (exploit all cores as efficiently as
possible).  I talk about this in more detail here:
https://speakerdeck.com/trent/parallelism-and-concurrency-with-python?slide=....
But good grief, it is orders of magnitude more complex at every level.  A
less stubborn version of me would have given up waaaay earlier.  Glad I
stuck with it though, really happy with results so far.
Trent.
From: gvanrossum@gmail.com [mailto:gvanrossum@gmail.com] On Behalf Of Guido
van Rossum
Sent: Wednesday, November 26, 2014 4:49 PM
To: Trent Nelson
Cc: Paul Colomiets; python-ideas
Subject: Re: [Python-ideas] Asynchronous IO ideas for Python
Trent,
Can you post source for the regular and pyparallel HTTP servers you used?
On Wed, Nov 26, 2014 at 12:56 PM, Trent Nelson <trent@snakebite.org> wrote:
Relevant part of the video with the normal Python stats on the left and
PyParallel on the right:
https://www.youtube.com/watch?v=4L4Ww3ROuro#t=838
Transcribed stats:
Regular Python HTTP server:
Thread Stats    Avg     Stdev   Max
 Latency        4.93ms  714us   10ms
 Req/Seq        552     154     1.1k
10,480 requests in 10s, 69MB
1048 reqs/sec, 6.9MB/s
PyParallel (4 core Windows VM):
Thread Stats    Avg     Stdev   Max
 Latency        2.41ms  531us   10ms
 Req/Seq        1.74k   183     2.33k
32,831 requests in 10s, 216MB
3263 reqs/sec, 21MB/s
So basically a bit less than linear scaling with more cores, which isn't too
bad for a full debug build running on a VM.
-----Original Message-----
From: Trent Nelson
Sent: Wednesday, November 26, 2014 3:36 PM
To: 'Paul Colomiets'; python-ideas
Subject: RE: [Python-ideas] Asynchronous IO ideas for Python
Have you seen this?:
https://speakerdeck.com/trent/pyparallel-how-we-removed-the-gil-and-exploite...
I spend the first 80-ish slides on async I/O.
(That was a year ago.  I've done 2-3 sprints on it since then and have
gotten it to a point where I can back up the claims with hard numbers on
load testing benchmarks, demonstrated in the most recent video:
https://www.youtube.com/watch?v=4L4Ww3ROuro.)
Trent.
-----Original Message-----
From: Python-ideas
[mailto:python-ideas-bounces+trent=snakebite.org@python.org] On Behalf Of
Paul Colomiets
Sent: Wednesday, November 26, 2014 12:35 PM
To: python-ideas
Subject: [Python-ideas] Asynchronous IO ideas for Python
Hi,
I've written an article about how I perceive the future of asynchronous I/O
in Python. It's not something that should directly be incorporated into
python now, but I believe it's useful for python-ideas list.
https://medium.com/@paulcolomiets/the-future-of-asynchronous-io-in-python-ce...
And a place for comments at Hacker News:
https://news.ycombinator.com/item?id=8662782
I hope being helpful with this writeup :)
--
Paul
_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/
_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/
--
--Guido van Rossum (python.org/~guido)
_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Asynchronous IO ideas for Python

Charles-François Natali