Sure can!  I dumped the entire contents of my PyParallel source repository (including the full build .exes/dlls, Visual Studio files, etc) to here about an hour before I gave that presentation:


So to replicate that test you'd clone that, then run one of the helpers, python33-http-server.bat or pyparallel-http-server.bat.


(There's also, but it's 117MB, so... I'd recommend


The Python 3.3 version is literally `python -m http.server`; the only change I made to the stdlib http/ is this:


% diff -u cpython.hg/Lib/http/
--- cpython.hg/Lib/http/       2014-08-14 16:00:29.000000000 -0400
+++       2014-11-26 18:12:58.000000000 -0500
@@ -328,7 +328,7 @@
         conntype = self.headers.get('Connection', "")
         if conntype.lower() == 'close':
             self.close_connection = 1
-        elif (conntype.lower() == 'keep-alive' and
+        elif (conntype.lower() == 'keep-alive' or
               self.protocol_version >= "HTTP/1.1"):
             self.close_connection = 0
         # Examine the headers and look for an Expect directive
@@ -440,7 +440,7 @@
         version and the current date.

-        self.log_request(code)
+        #self.log_request(code)
         self.send_response_only(code, message)
         self.send_header('Server', self.version_string())
         self.send_header('Date', self.date_time_string())


The PyParallel version is basically `python_d -m async.http.server`, which is this:


I started with the stdlib version and mostly refactored it a bit for personal style reasons, then made it work for PyParallel.  The only PyParallel-specific piece is actually this line:



return response.transport.sendfile(before, path, None)


Everything else is just normal Python, nothing special -- it just conforms to the current constraints of PyParallel.  Basically, the HttpServer.data_received() method will be invoked from parallel threads, not the main interpreter thread.


To give you an idea how the protocol/transport stuff is wired up, the standalone launcher stuff is at the bottom of that file:


ipaddr = socket.gethostbyname(socket.gethostname())

server = async.server(ipaddr, 8080)

async.register(transport=server, protocol=HttpServer)



As for why I haven't publicized this stuff until now, to quote myself from that video... "It currently crashes, a lot.  I know why it's crashing, I just haven't had the time to fix it yet.  But hey, it's super fast until it does crash ;-)"


By crash, I mean I'm hitting an assert() in my code -- it happens after the benchmark runs and has to do with the asynchronous socket disconnect logic.  I tried fixing it properly before giving that talk, but ran out of time (


I'll fix all of that the next sprint... which will be... heh, hopefully around Christmas?


Oh, actually, the big takeaway from the PyGotham sprint was that I spent an evening re-applying all the wild commit and hackery I'd accumulated to a branch created from the 3.3.5 tag:  So diff that against the 3.3.5 tag to get an idea of what interpreter changes I needed to make to get to this point.  (I have no idea why I didn't pick a tag to work off when I first started -- I literally just started hacking on whatever my local tip was on, which was some indeterminate state between... 3.2 and 3.3?)


Side note: I'm really happy with how everything has worked out so far, it is exactly how I envisioned it way back in those python-ideas@ discussions that resulted in tulip/asyncio.  I was seeing ridiculously good scaling on my beefier machine at home (8 core, running native) -- to the point where I was maxing out the client machine at about 50,000 requests/sec (~100MB/s) and the PyParallel box was only at about 40% CPU use.


Oh, and it appears to be much faster than node.js's http-server too (`npm install http-server`, cd into the website directory, `http-server -s .` to get an equivalent HTTP server from node.js), which I thought was cute.  Well, I expected it to be, that's the whole point of being able to exploit all cores and not doing single threaded multiplexing -- so it was good to see that being the case.


Node wasn't actually that much faster than Python's normal http.server if I remember correctly.  It definitely used less CPU overall than the Python one -- basically what I'm seeing is that Python will be maxing out one core, which should only be 25% CPU (4 core VM), but actual CPU use is up around 50%, and it's mostly kernel time making up the other half.  Node will also max out a core, but overall CPU use is ~30%.  I attribute this to Python's http.server using select(), whereas I believe node.js ends up using IOCP in a single-threaded event loop.  So, you could expect Python asyncio to get similar performance to node, but they're both crushed by PyParallel (until it crashes, heh) as soon as you've got more than one core, which was the point I've been vehemently making from day one ;-)


And I just realized I'm writing this e-mail on the same laptop that did that demo, so I can actually back all of this up with a quick run now.


Python 3.3

On Windows:

λ python33-http-server.bat
Serving HTTP on port 8000 ...

On Mac:


(trent@raptor:ttys003) (Wed/19:06) .. (~s/wrk)

% ./wrk -c 8 -t 2 -d 10 --latency

Running 10s test @

2 threads and 8 connections

Thread Stats Avg Stdev Max +/- Stdev

Latency 6.33ms 1.74ms 18.42ms 75.65%

Req/Sec 419.77 119.93 846.00 67.43%

Latency Distribution

50% 6.26ms

75% 7.15ms

90% 8.21ms

99% 12.42ms

8100 requests in 10.00s, 53.48MB read

Requests/sec: 809.92

Transfer/sec: 5.35MB



On Windows:


λ http-server -s .


On Mac:

(trent@raptor:ttys003) (Wed/19:07) .. (~s/wrk)

% ./wrk -c 8 -t 2 -d 10 --latency

Running 10s test @

2 threads and 8 connections

Thread Stats Avg Stdev Max +/- Stdev

Latency 6.44ms 2.40ms 19.70ms 84.77%

Req/Sec 621.94 124.26 0.94k 68.05%

Latency Distribution

50% 5.93ms

75% 7.00ms

90% 8.97ms

99% 16.17ms

12021 requests in 10.00s, 80.84MB read

Requests/sec: 1201.98

Transfer/sec: 8.08MB




On Windows:


λ pyparallel-http-server.bat
Serving HTTP on port 8080 ...
Traceback (most recent call last):
  File "C:\Users\Trent\src\pyparallel-0.1-3.3.5\Lib\", line 160, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "C:\Users\Trent\src\pyparallel-0.1-3.3.5\Lib\", line 73, in _run_code
    exec(code, run_globals)
  File "C:\Users\Trent\src\pyparallel-0.1-3.3.5\Lib\ctk\", line 518, in <module>
    cli = run(*args)
  File "C:\Users\Trent\src\pyparallel-0.1-3.3.5\Lib\ctk\", line 488, in run
    return CLI(*args, **kwds)
  File "C:\Users\Trent\src\pyparallel-0.1-3.3.5\Lib\ctk\", line 272, in __init__
  File "C:\Users\Trent\src\pyparallel-0.1-3.3.5\Lib\ctk\", line 278, in run
  File "C:\Users\Trent\src\pyparallel-0.1-3.3.5\Lib\ctk\", line 424, in _process_commandline
  File "C:\Users\Trent\src\pyparallel-0.1-3.3.5\Lib\ctk\", line 217, in run
  File "C:\Users\Trent\src\pyparallel-0.1-3.3.5\Lib\ctk\", line 455, in start
  File "C:\Users\Trent\src\pyparallel-0.1-3.3.5\Lib\px\", line 90, in run
OSError: [WinError 8] Not enough storage is available to process this command
_PyParallel_Finalize(): px->contexts_active: 462
[92105 refs]
_PyParallel_DeletingThreadState(): px->contexts_active: 462


Oh dear :-)  Hadn't seen that before.  The VM has 4GB allocated to it... I checked taskmgr and it was reporting ~90% physical memory use.  Closed a bunch of things and got it down to 54%, then re-ran, that did the trick.  Including this info in case anyone else runs into this.



λ pyparallel-http-server.bat
Serving HTTP on port 8080 ...


On Mac:

(trent@raptor:ttys003) (Wed/19:16) .. (~s/wrk)

% ./wrk -c 8 -t 2 -d 10 --latency

Running 10s test @

2 threads and 8 connections

Thread Stats Avg Stdev Max +/- Stdev

Latency 4.04ms 1.80ms 23.35ms 91.16%

Req/Sec 1.07k 191.81 1.54k 75.00%

Latency Distribution

50% 3.68ms

75% 4.41ms

90% 5.40ms

99% 13.04ms

20317 requests in 10.00s, 134.22MB read

Requests/sec: 2031.33

Transfer/sec: 13.42MB


And then back on Windows after the benchmark completes:


λ pyparallel-http-server.bat
Serving HTTP on port 8080 ...
Assertion failed: s->io_op == PxSocket_IO_SEND, file ..\Python\pyparallel.c, line 6311
Assertion failed: s->io_op == PxSocket_IO_SEND, file ..\Python\pyparallel.c, line 6311
Assertion failed: s->io_op == PxSocket_IO_SEND, file ..\Python\pyparallel.c, line 6311
Assertion failed: s->io_op == PxSocket_IO_SEND, file ..\Python\pyparallel.c, line 6311
Assertion failed: s->io_op == PxSocket_IO_SEND, file ..\Python\pyparallel.c, line 6311
Assertion failed: s->io_op == PxSocket_IO_SEND, file ..\Python\pyparallel.c, line 6311

Heh.  That's the crashing I was referring to.


So basically, it's better in every category (lowest latency, lowest jitter (stddev), highest throughput) for the duration of the benchmark, then crashes :-)


(Basically, my DisconnectEx assumptions regarding overlapped sockets, socket resource reuse, I/O completion ports, and thread pools were... not correct apparently.)


I remain committed to the assertion that the Windows' kernel approach to asynchronous I/O via I/O completion ports is fundamentally superior to the UNIX/POSIX approach in every aspect if you want to optimally use contemporary multicore hardware (exploit all cores as efficiently as possible).  I talk about this in more detail here:


But good grief, it is orders of magnitude more complex at every level.  A less stubborn version of me would have given up waaaay earlier.  Glad I stuck with it though, really happy with results so far.






From: [] On Behalf Of Guido van Rossum
Sent: Wednesday, November 26, 2014 4:49 PM
To: Trent Nelson
Cc: Paul Colomiets; python-ideas
Subject: Re: [Python-ideas] Asynchronous IO ideas for Python



Can you post source for the regular and pyparallel HTTP servers you used?


On Wed, Nov 26, 2014 at 12:56 PM, Trent Nelson <> wrote:

Relevant part of the video with the normal Python stats on the left and PyParallel on the right:

Transcribed stats:

Regular Python HTTP server:

Thread Stats    Avg     Stdev   Max
 Latency        4.93ms  714us   10ms
 Req/Seq        552     154     1.1k
10,480 requests in 10s, 69MB
1048 reqs/sec, 6.9MB/s

PyParallel (4 core Windows VM):

Thread Stats    Avg     Stdev   Max
 Latency        2.41ms  531us   10ms
 Req/Seq        1.74k   183     2.33k
32,831 requests in 10s, 216MB
3263 reqs/sec, 21MB/s

So basically a bit less than linear scaling with more cores, which isn't too bad for a full debug build running on a VM.

-----Original Message-----
From: Trent Nelson
Sent: Wednesday, November 26, 2014 3:36 PM
To: 'Paul Colomiets'; python-ideas
Subject: RE: [Python-ideas] Asynchronous IO ideas for Python

Have you seen this?:

I spend the first 80-ish slides on async I/O.

(That was a year ago.  I've done 2-3 sprints on it since then and have gotten it to a point where I can back up the claims with hard numbers on load testing benchmarks, demonstrated in the most recent video:


-----Original Message-----
From: Python-ideas [] On Behalf Of Paul Colomiets
Sent: Wednesday, November 26, 2014 12:35 PM
To: python-ideas
Subject: [Python-ideas] Asynchronous IO ideas for Python


I've written an article about how I perceive the future of asynchronous I/O in Python. It's not something that should directly be incorporated into python now, but I believe it's useful for python-ideas list.

And a place for comments at Hacker News:

I hope being helpful with this writeup :)

Python-ideas mailing list
Code of Conduct:
Python-ideas mailing list
Code of Conduct:


--Guido van Rossum (