While I was implementing JSON-JWS (JSON web signatures), a format
which in Python 3 has to go from bytes > unicode > bytes > unicode
several times in its construction, I notice I wrote a lot of bugs:
When I meant to say:
Everything worked perfectly on Python 3 because the verifying code
also generated the sha256=b'abcdef1234' as a comparison. I would have
never noticed at all unless I had tried to verify the Python 3 output
with Python 2.
I know I'm a bad person for not having unit tests capable enough to
catch this bug, a bug I wrote repeatedly in each layer of the bytes >
unicode > bytes > unicode dance, and that there is no excuse for being
confused at any time about the type of a variable, but I'm not willing
Instead, I would like a new string formatting operator tentatively
called 'notbytes': "sha256=%notbytes" % (b'abcdef1234'). It gives the
same error as 'sha256='+b'abc1234' would: TypeError: Can't convert
'bytes' object to str implictly
Just an idea of usability fix for Python 3.
hexdump module (function or bytes method is better) as simple, easy
and intuitive way for dumping binary data when writing programs in
hexdump(bytes) - produce human readable dump of binary data,
byte-by-byte representation, separated by space, 16-byte rows
Generic binary data can't be output to console. A separate helper
is needed to print, log or store its value in human readable format in
database. This takes time.
binascii is ugly: name is not intuitive any more, there are a lot
of functions, and it is not clear how it relates to unicode.
It is convenient to have format that can be displayed in a text
editor. Simple tools encourage people to use them.
� � � �� �� � �� �� �
� � �
>>> data = hexdump(b)
E6 B0 08 04 E7 9E 08 04 E7 BC 08 04 E7 D5 08 04
E7 E4 08 04 E6 B0 08 04 E7 F0 08 04 E7 FF 08 04
E8 0B 08 04 E8 1A 08 04 E6 B0 08 04 E6 B0 08 04
>>> # achieving the same output with binascii is overcomplicated
>>> data_lines = [binascii.hexlify(b)[i:min(i+32, len(binascii.hexlify(b)))] for i in xrange(0, len(binascii.hexlify(b)), 32)]
>>> data_lines = [' '.join(l[i:min(i+2, len(l))] for i in xrange(0, len(l), 2)).upper() for l in data_lines]
E6 B0 08 04 E7 9E 08 04 E7 BC 08 04 E7 D5 08 04
E7 E4 08 04 E6 B0 08 04 E7 F0 08 04 E7 FF 08 04
E8 0B 08 04 E8 1A 08 04 E6 B0 08 04 E6 B0 08 04
On the other side, getting rather useless binascii output from
hexdump() is quite trivial:
>>> data.replace(' ','').replace('\n','').lower()
But more practical, for example, would be counting offset from hexdump:
>>> print( ''.join( '%05x: %s\n' % (i*16,l) for i,l in enumerate(hexdump(b).split('\n'))))
By providing better building blocks on basic level Python will become
a better tool for more useful tasks.
Work priorities don't allow me to spend another day replying in detail
to the various emails on this topic, but I am still keeping up
I have read Greg's response to my comparison between
Future+yield-based coroutines and his yield-from-based, Future-free
coroutines, and after having written a small prototype, I am now
pretty much convinced that Greg's way is superior. This doesn't mean
you can't use generators or yield-from for other purposes! It's just
that *if* you are writing a coroutine for use with a certain schedule,
you must use yield and yield-from in accordance to the scheduler's
rules. However, code you call can still use yield and yield-from for
iteration, and you can still use for-loops. In particular, if f is a
coroutine, it can still write "for x in g(): ..." where g is a
generator meant to be an iterator. However if g were instead a
coroutine, f should call it using "yield from g()", and f and g should
agree on the interface of their scheduler.
As to other topics, my current feeling is that we should try to
separately develop requirements and prototype implementations of the
I/O loop of the future, and to figure the loosest possible coupling
between that and a coroutine scheduler (or any other type of
scheduler). In particular, I think the I/O loop should not assume the
event handlers are implemented using coroutines -- but if someone
wants to write an awesome coroutine scheduler, they should be able to
delegate all their I/O waiting needs to the I/O loop with very little
To me, this means that the I/O loop probably should use "plain"
callback functions (i.e., not Futures, Deferreds or coroutines). We
should also standardize the interface to the I/O loop so that 3rd
parties can plug in their own I/O loop -- I don't see an end to the
debate whether the best C library for event handling is libevent,
libev or libuv.
While the focus of the I/O loop should be on single-threaded event
handling, some standard interface should exist so that you can run
certain code in a separate thread and wait for its completion -- I've
found this handy when calling socket.getaddrinfo(), which may block.
(Apparently async DNS lookups are really hard -- I read some
complaints about libevent's DNS lookups, and IIUC many Firefox
lockups are due to this.) But there may be other uses for this too.
An issue in the design of the I/O loop is the strain between a
ready-based and completion-based design. The typical Unix design
(whether based on select or any of the poll variants) is usually
ready-based; but on Windows, the only way to get high performance is
to base it on IOCP, which is completion-based (i.e. you start a
specific async operation, like writing N bytes, and the I/O loop tells
you when it is done). I would like people to be able to write fast
event handling programs on Windows too, and ideally the only change
would be the implementation of the I/O loop. But I don't know how
tenable that is given the dramatically different style used by IOCP
and the need to use native Windows API for all async I/O -- it sounds
like we could only do this if the library providing the I/O loop
implementation also wrapped all I/O operations, andthat may be a bit
Finally, there should also be some minimal interface so that multiple
I/O loops can interact -- at least in the case where one I/O loop
belongs to a GUI library. It seems this is a solved problem (as well
solved as you can hope for) to Twisted, so we should just adopt their
--Guido van Rossum (python.org/~guido)
I'd like to propose adding the ability for context managers to catch and
handle control passing into and out of them via yield and generator.send()
def __init__(self, path):
self.inner_path = path
self.outer_path = os.getcwd()
def __exit__(self, exc_type, exc_val, exc_tb):
self.inner_path = os.getcwd()
self.outer_path = os.getcwd()
Here __yield__() would be called when control is yielded through the with
block and __send__() would be called when control is returned via .send()
or .next(). To maintain compatibility, it would not be an error to leave
either __yield__ or __send__ undefined.
The rationale for this is that it's sometimes useful for a context manager
to set global or thread-global state as in the example above, but when the
code is used in a generator, the author of the generator needs to make
assumptions about what the calling code is doing. e.g.
Even if the author of this generator knows what effect do_something() and
do_something_else() have on the current working directory, the author needs
to assume that the caller of the generator isn't touching the working
directory. For instance, if someone were to create two my_generator()
generators with different paths and advance them alternately, the resulting
behaviour could be most unexpected. With the proposed change, the context
manager would be able to handle this so that the author of the generator
doesn't need to make these assumptions.
Naturally, nested with blocks would be handled by calling __yield__ from
innermost to outermost and __send__ from outermost to innermost.
I rather suspect that if this change were included, someone could come up
with a variant of the contextlib.contextmanager decorator to simplify
writing generators for this sort of situation.
J. D. Bartlett
A weekend or two ago, I was planning on doing some work on some
ideas I had regarding IOCP and the tulip/async-IO discussion.
I ended up getting distracted by WSAPoll. WSAPoll is a method
that Microsoft introduced with Vista/2008 that is intended to
be semantically equivalent to poll() on UNIX.
I decided to play around and see what it would take to get it
available via select.poll() on Windows, eventually hacking it
into a working state.
So, it basically works. poll() on Windows, who would have thought.
It's almost impossible to test with our current infrastructure; all
our unit tests seem to pass pipes and other non-Winsock-backed-socks
to poll(), which, like select()-on-Windows, isn't supported.
I suspect Twisted's test suite would give it a much better work out
(CC'd Glyph just so it's on his radar). I ended up having to verify
it worked with some admittedly-hacky dual-python-console sessions,
one poll()'ing as a server, the other connecting as a client. It
definitely works, so, it's worth keeping it in mind for the future.
It's still equivalent to poll()'s O(N) on UNIX, but that's better
than the 64/512 limit select is currently stuck with on Windows.
Didn't have much luck trying to get the patched Python working with
tulip's PollReactor, unfortunately, so I just wanted to provide some
feedback on that experience.
First bit of feedback: man, debugging `yield from` stuff is *hard*.
Tulip just flat out didn't work with the PollReactor from the start
but it was dying in a non-obvious way.
So, I attached both a Pdb debugger and Visual Studio debugger and
tried to step through everything to figure out why the first call
to poll() was blowing up (can't remember the exact error message
but it was along the lines of "you can't poll() whatever it is you
just asked me to poll(), it's defo' not a socket").
I eventually, almost by pure luck, traced the problem to the fact
that PollReactor's __init__ eventually results in code being called
that calls poll() on two os.pipe() objects (in EventLoop I think).
However, when I was looking at the code, it appeared as though the
first poll() came from the getaddrinfo(). So all my breakpoints
and whatnot were geared towards that, yet none of them were being
hit, yet poll() was still being called somehow, somewhere.
I ended up having to spend ages traipsing through every line in
Visual Studio's debugger to try figure out what the heck was going
on. I believe the `yield from` aspect made that so much more of an
arduous affair -- one moment I'm in selectmodule.c's getaddrinfo(),
then I'm suddenly deep in the bowels of some cryptic eval frame
black magic, then one 'step' later, I'm over in some completely
different part of selectmodule.c, and so on.
I think the reason I found it so tough was because when you're
single stepping through each line of a C program, you can sort of
always rely on the fact you know what's going to happen when you
"step" the next line.
In this case though, a step of an eval frame would wildly jump
to seemingly unrelated parts of C code. As far as I could tell,
there was no easy/obvious way to figure the details out before
stepping that instruction either (i.e. probing the various locals
So, that's the main feedback from that weekend, I guess. Granted,
it's more of a commentary on `yield from` than tulip per se, but I
figured it would be worth offering up my experience nevertheless.
I ended up with the following patch to avoid the initial poll()
against os.pipe() objects:
--- a/polling.py Sat Nov 03 13:54:14 2012 -0700
+++ b/polling.py Tue Nov 27 07:05:10 2012 -0500
@@ -41,6 +41,7 @@
@@ -459,6 +460,10 @@
def __init__(self, eventloop, executor=None):
+ if sys.platform == 'win32':
+ # Work around the fact that we can't poll pipes on Windows.
+ if isinstance(eventloop.pollster, PollPollster):
+ eventloop = EventLoop(SelectPollster())
self.eventloop = eventloop
self.executor = executor # Will be constructed lazily.
self.pipe_read_fd, self.pipe_write_fd = os.pipe()
By that stage it was pretty late in the day and I accepted defeat.
My patch didn't really work, it just allowed the test to run to
completion without the poll OSError exception being raised.
[ It's tough coming up with unique subjects for these async
discussions. I've dropped python-dev and cc'd python-ideas
instead as the stuff below follows on from the recent msgs. ]
Provide an async interface that is implicitly asynchronous;
all calls return immediately, callbacks are used to handle
How the asynchronicity (not a word, I know) is achieved is
an implementation detail, and will differ for each platform.
(Windows will be able to leverage all its async APIs to full
extent, Linux et al can keep mimicking asynchronicity via
the usual non-blocking + multiplexing (poll/kqueue etc),
thread pools, etc.)
On Wed, Nov 28, 2012 at 11:15:07AM -0800, Glyph wrote:
> On Nov 28, 2012, at 12:04 PM, Guido van Rossum <guido(a)python.org> wrote:
> I would also like to bring up <https://github.com/lvh/async-pep> again.
So, I spent yesterday working on the IOCP/async stuff. The saw this
PEP and the sample async/abstract.py. That got me thinking: why don't
we have a low-level async facade/API? Something where all calls are
On systems with extensive support for asynchronous 'stuff', primarily
Windows and AIX/Solaris to a lesser extent, we'd be able to leverage
the platform-provided async facilities to full effect.
On other platforms, we'd fake it, just like we do now, with select,
poll/epoll, kqueue and non-blocking sockets.
Consider the following:
__slots__ = [
def getaddrinfo(host, port, ..., cb):
def getaddrinfo_then_connect(.., callbacks=(cb1, cb2))
def accept(sock, cb):
def accept_then_write(sock, buf, (cb1, cb2)):
def accept_then_expect_line(sock, line, (cb1, cb2)):
def accept_then_expect_multiline_regex(sock, regex, cb):
def read_until(fd_or_sock, bytes, cb):
def read_all(fd_or_sock, cb):
return self.read_until(fd_or_sock, EOF, cb)
def read_until_lineglob(fd_or_sock, cb):
def read_until_regex(fd_or_sock, cb):
def read_chunk(fd_or_sock, chunk_size, cb):
def write(fd_or_sock, buf, cb):
def write_then_expect_line(fd_or_sock, buf, (cb1, cb2)):
def submit_work(callable, cb):
"""Run the event loop once."""
"""Keep running the event loop until exit."""
All methods always take at least one callback. Chained methods can
take multiple callbacks (i.e. accept_then_expect_line()). You fill
in the success, failure (both callables) and timeout (an int) slots.
The engine will populate cb.cancel with a callable that you can call
at any time to (try and) cancel the IO operation. (How quickly that
works depends on the underlying implementation.)
I like this approach for two reasons: a) it allows platforms with
great async support to work at their full potential, and b) it
doesn't leak implementation details like non-blocking sockets, fds,
multiplexing (poll/kqueue/select, IOCP, etc). Those are all details
that are taken care of by the underlying implementation.
getaddrinfo is a good example here. Guido, in tulip, you have this
def getaddrinfo(host, port, af=0, socktype=0, proto=0):
infos = yield from scheduling.call_in_thread(
host, port, af,
That's very implementation specific. It assumes the only way to
perform an async getaddrinfo is by calling it from a separate
thread. On Windows, there's native support for async getaddrinfo(),
which we wouldn't be able to leverage here.
The biggest benefit is that no assumption is made as to how the
asynchronicity is achieved. Note that I didn't mention IOCP or
kqueue or epoll once. Those are all implementation details that
the writer of an asynchronous Python app doesn't need to care about.
I was hoping to have some code to go along with this idea, but the
WSAPoll stuff distracted me, so, I don't have any concrete examples
As part of (slowly) catching up on all the async IO discussions, I
reviewed both Twisted's current iocpreactor implementation, as well
as Richard Oudkerk's IOCP/tulip work:
Both implementations caught me a little off-guard. The Twisted
iocpreactor appears to drive a 'one shot' iteration of GetQueued-
CompletionStatus per event loop iteration -- sort of treating it
just like select/poll (in that you call it once to get a list of
things to do, do them, then call it again).
Richard's work sort of works the same way... the proactor drives
the completion port polling via GetQueuedCompletionStatus.
From what I know about overlapped IO and IOCP, this seems like a
really odd way to do things: by *not* using a thread pool, whose
threads' process_some_io_that_just_completed() methods are auto-
matically called* by the underlying OS, you're getting all of the
additional complexity of overlapped IO and IOCP without any of the
[*]: On AIX, Solaris and Windows XP/2003, you'd manually spawn
a bunch of threads and have them call GetQueuedCompletion-
Status or port_get(). Those methods return when IO is
available on the given completion port.
On Windows 7/2008R2, you can leverage the new thread pool
APIs. You don't need to create a single thread yourself,
just spin up a pool and associate the IO completion with
it -- Windows will automatically manage the underlying
Windows 8/2012 introduces a new API, Registered I/O,
which leverages pre-registered buffers and completion
ports to minimize overhead of copying data from kernel
to user space. It's pretty nifty. You use RIO in concert
with thread pools and IOCP.
Here's the "idea" I had, with zero working code to back it up:
what if we had a bunch of threads in the background whose sole
purpose it was to handle AIO? On Windows/AIX, they would poll
GetQueuedCompletionStatus, on Solaris, get_event().
They're literally raw pthreads and have absolutely nothing to
do with Python's threading.Thread() stuff. They exist solely
in C and can't be interfaced to directly from Python code.
....which means they're free to run outside the GIL, and thus,
multiple cores could be leveraged concurrently. (Only for
processing completed I/O, but hey, it's better than nothing.)
The threads would process the completion port events via C code
and allocate necessary char * buffers on demand. Upon completion
of their processing, let's say, reading 4096 bytes from a socket,
they push their processed event and data to an interlocked* list,
then go back to GetQueuedCompletionStatus/get_event.
You eventually need to process these events from Python code.
Here's where I think this approach is neat: we could expose a new
API that's semantically pretty close to how poll() works now.
Except instead of polling a bunch of non-blocking file descriptors
to see what's ready for reading/writing, you're simply getting a
list of events that completed in the background.
You process the events, then presumably, your event loop starts
again: grab available events, process them. The key here is that
nothing blocks, but not because of non-blocking sockets.
Nothing blocks because the reads() have already taken place and
the writes() return immediately. So your event loop now becomes
this tight little chunk of code that spins as quickly as it can
on processing events. (This would lend itself very well to the
Twisted notion of deferring blocking actions (like db stuff) via
callFromThread(), but I'll ignore that detail for now.)
So, presuming your Python code looks something like this:
for event in aio.events():
if event.type == EventType.DataReceived:
elif event.type == ...
Let's talk about how aio.events() would work. I mentioned that the
background threads, once they've completed processing their event,
push the event details and data onto an interlocked list.
I got this idea from the interlocked list methods available in
Windows since XP. They facilitate synchronized access to a singly
linked list without the need for explicit mutexes. Some sample
More info: http://msdn.microsoft.com/en-us/library/windows/desktop/ms684121(v=vs.85)...
So, the last thing a background thread does before going back to
poll GetQueuedCompletionStatus/get_event is an interlocked push
onto a global list. What would it push? Depends on the event,
at the least, an event identifier, at the most, an event identifier
and pointer to the char * buffer allocated by the thread, perhaps?
Now, when aio.events() is called, we have some C code that does
an interlocked flush -- this basically pops all the entries off
the list in an interlocked, atomic fashion.
It then loops over all the events and creates the necessary CPython
objects that can then be used in the subsequent Python code. So,
for data received, it would call PyBytesBuffer_FromString(...) with
the buffer indicated in the event, then free() that chunk of memory.
(That was just the first idea that came to my mind; there are
probably tons of better ways to do it in practice. The point
is that however its done, the end result is a GC-tracked object
with no memory leaks from the background thread buffer alloc.
Perhaps there could be a separate interlocked list of shared
buffers that the background threads pop from when they need
a buffer, and the Python aio.events() code pushes to when
it is done converting the buffer into a CPython object.
....and then down the track, subsequent optimizations that
allow the CPython object to inherit the buffer for its life-
time, removing the need to constantly copy data from back-
ground thread buffers to CPython buffers.)
And... I think that's the crux of it really. Key points are actual
asynchronous IO, carried out by threads that aren't GIL constrained
and thus, able to run concurrently -- coupled with a pretty simple
Now, follow-on ideas from that core premise: the longer we can stay
in C land, the more performant the solution. Putting that little
tidbit aside, I want to mention Twisted again, because I think their
protocol approach is spot on:
def dataReceived(self, data):
def lineReceived(self, line):
That's a completely nonsense example, as you wouldn't have both a
lineReceived and dataReceived method, but it illustrates the point
of writing classes that are driven by events.
As for maintaining how long we can stay in C versus Python, consider
serving a HTTP request. You accept a connection, wait for headers,
then send headers+data back. From a Python programmer's perspective
you don't really care if data has been read unless you've received
the entire, well-formed set of headers from the client.
For an SMTP server, it's more chatty, read a line, send a line back,
read a few more lines, send some more lines back.
In both of these cases, you could apply some post-processing of data
in C, perhaps as a simple regex match to start with. It would be
pretty easy to regex match an incoming HELO/GET line, queue the well
formed ones for processing by aio.events(), and automatically send
pre-registered errors back for those that don't match.
Things like accept filters on BSD work like this; i.e. don't return
back to the calling code until there's a legitimate event. It
greatly simplifies the eventual Python implementation, too. Rather
than write your own aio.events()-based event loop, you'd take the
Twisted approach and register your protocol handlers with a global
"reactor" that is responsible for processing the raw aio.events()
and then invoking the relevant method on your class instances.
So, let's assume that's all implemented and working in 3.4. The
drawback of this approach is that even though we've allowed for
some actual threaded concurrency via background IO threads, the
main Python code that loops over aio.events() is still limited
to executing on a single core. Albeit, in a very tight loop that
never blocks and would probably be able to process an insane number
of events per second when pegging a single core at 100%.
So, that's 3.4. Perhaps in 3.5 we could add automatic support for
multiprocessing once the number of events per-poll reach a certain
threshold. The event loop automatically spreads out the processing
of events via multiprocessing, facilitating multiple core usage both
via background threads *and* Python code. (And we could probably do
some optimizations such that the background IO thread always queues
up events for the same multiprocessing instance -- which would yield
even more benefits if we had fancy "buffer inheritance" stuff that
removes the need to continually copy data from the background IO
buffers to the foreground CPython code.)
As an added bonus, by the time 3.5 rolls around, perhaps the Linux
and FreeBSD camps have seen how performant IOCP/Solaris-events can
be and added similar support (the Solaris event API wouldn't be that
hard to port elsewhere for an experienced kernel hacker. It's quite
elegant, and, hey, the source code is available). (We could mimic
it in the mean time with background threads that call epoll/kqueue,
Thoughts? Example code or GTFO? ;-)
In the recent thread I started called "Speed up os.walk()..."  I
was encouraged to create a module to flesh out the idea, so I present
you with BetterWalk:
It's basically all there, and works on Windows, Linux, and Mac OS X.
It probably works on FreeBSD too, but I haven't tested that. I also
haven't written thorough unit tests yet, but intend to after some
In terms of the API for iterdir_stat(), I settled on the more explicit
"pass in what stat fields you want" (the 'fields' parameter). I also
added a 'pattern' parameter to allow you to make use of the wildcard
matching that FindFirst/FindNext provide (it's useful for globbing on
POSIX too, but not a performance improvement).
As for benchmarks, it's about what I saw earlier on Windows (2-6x on
recent versions, depending). My initial tests on Mac OS X show it's
5-10x as fast on that platform! I haven't double-checked those results
The results on Linux were somewhat disappointing -- only a 10% speed
improvement on large directories, and it's actually slower on small
directories. It's still doing half the number of system calls ... so I
believe this is because cached os.stat() is super fast on Linux, and
so the slowdown from using ctypes / pure Python is outweighing the
gain from not doing the system call. That said, I've also only tested
Linux in a VirtualBox setup, so maybe that's affecting it too.
Still, if it's a significant win for Windows and OS X users, it's a good thing.
In any case, I'd love it if folks could run the benchmark on their
system (with and without -s) and comment further on the idea and API.
On 28.11.2012 16:49, Richard Oudkerk wrote:
> You are assuming that GetQueuedCompletionStatus*() will never block
> because of lack of work.
GetQueuedCompletionStatusEx takes a time-out argument, it can be zero.
According to Apple enineers:
For API outside of POSIX, including GCD and technologies like
Accelerate, we do not support usage on both sides of a fork(). For
this reason among others, use of fork() without exec is discouraged in
general in processes that use layers above POSIX.
Multiprocessing on OSX calls os.fork, but not os.exec.
Thus, is multiprocessing errorneously implemented on Mac? Forking
without calling exec means that only APIs inside POSIX can be used by
the child process.
For NumPy, it even affects functions like matrix multiplication when the
accelerate framework is used for BLAS.
Does multiprocessing needs a reimplementation on Mac to behave as it
does on Windows? (Yes it would cripple it similarly to the crippled
multiprocessing on Windows.)
And what about Python itself? Is there any non-POSIX code in the
interpreter? If it is, os.fork should be removed on Mac.