[Python-ideas] Async API: some more code to review

Tue Oct 30 16:57:39 CET 2012

Guido van Rossum wrote:
> I don't want to beat around the bush, I think your approach is too slow. In may
> situations I would be guilty of premature optimization saying this, but (a) the
> whole *point* of async I/O is to be blindingly fast (the C10K problem), and (b)
> the time difference is rather marked.
> 
> I wrote a simple program for each version (attached) that times a simple
> double-recursive function, where each recursive level uses yield.
>
> With a depth of 20, wattle takes about 24 seconds on my MacBook Pro.
> And the same problem in tulip takes 0.7 seconds! That's close to two orders of
> magnitude. Now, this demo is obviously geared towards showing the pure overhead
> of the "one future per level" approach compared to "pure yield from". But that's
> what you're proposing.

I get similar results on my machine with those benchmarks, though the difference
was not so significant with my own (100 connections x 100 messages to
SocketSpam.py - I included SocketSpamStress.py). The only time there was more
than about 5% difference was when the 'yield from' case was behaving completely
differently (each connection's routine was not interleaving with the others - my
own bug, which I fixed).

Choice of scheduler makes a difference as well. Using my
UnthreadedSocketScheduler() instead of SingleThreadedScheduler() halves the time
taken, and just using "main(depth).result()" reduces that by about 10% again. It
still is not directly comparable to tulip, but there are ways to make them
equivalent (discussed below).

> And I think allowing the user to mix yield and yield from is just too risky.

The errors involved when you get yield and yield from confused are quite clear
in this case. However, if you use 'yield' instead of 'yield from' in tulip, you
simply don't ever run that function. Maybe this will give you an error further
down the track, but it won't be as immediate.

On the other hand, if you're really after extreme performance (*cough*use
C*cough* :) ) we can easily add an "__unwrapped__" attribute to @async that
provides access to the internal generator, which you can then 'yield from' from:

@async
def binary(n):
    if n <= 0:
        return 1
    l = yield from binary.__unwrapped__(n-1)
    r = yield from binary.__unwrapped__(n-1)
    return l + 1 + r

With this change the performance is within 5% of tulip (most times are up to 5%
slower, but some are faster - I'd say margin of error), regardless of the
scheduler. (I've no doubt this could be improved further by modifying _Awaiter
and Future to reduce the amount of memory allocations, and a super optimized
library could use C implementations that still fit the API and work with
existing code.)

I much prefer treating 'yield from __unwrapped__' as an advanced case, so I'm
all for providing ways to optimize async code where necessary, but when I think
about how I'd teach this to a class of undergraduates I'd much rather have the
simpler @async/yield rule (which doesn't even require an understanding of
generators). For me, "get it to work" and "get it to work, fast" comes well
before "get it to work fast".

> (I got rid of block_r/w() + bare yield as a public API from tulip -- that API is
> now wrapped up in a generator too. And I can do that without feeling guilty
> knowing that an extra level of generators costs me almost nothing.

I don't feel particularly guilty about the extra level... if the operations
you're blocking on are that much quicker than the overhead then you probably
don't need to block. I'm pretty certain that even with multiple network cards
you'll still suffer from bus contention before suffering from generator
overhead.

> Debugging experience: I made the same mistake in each program (I guess I copied
> it over before fixing the bug :-), which caused an AttributeError to happen at
> the time.time() call. In both frameworks this was baffling, because it caused
> the program to exit immediately without any output. So on this count we're even.
> :-)

This is my traceback once I misspell time():

...>c:\Python33_x64\python.exe wattle_bench.py
Traceback (most recent call last):
  File "wattle_bench.py", line 27, in <module>
    SingleThreadScheduler().run(main, depth=depth)
  File "SingleThreadScheduler.py", line 106, in run
    raise self._exit_exception
  File "scheduler.py", line 171, in _step
    next_future = self.generator.send(result)
  File "wattle_bench.py", line 22, in main
    t1 = time.tme()
AttributeError: 'module' object has no attribute 'tme'

Of course, if you do call an @async function and don't yield (or call result())
then you won't ever see an exception. I don't think there's any nice way to
propagate these automatically (except maybe through a finalizer... not so keen
on that). You can do 'op.add_done_callback(Future.result)' to force the error to
be raised somewhere (or better yet, pass it to a logger - this is why we allow
multiple callbacks, after all).

> I have to think more about what I'd like to borrow from wattle -- I agree that
> it's nice to mark up async functions with a decorator (it just shouldn't affect
> call speed), I like being able to start a task with a single call. 

You'll probably find (as I did in my early work) that starting the task in the
initial call doesn't work with yield from. Because it does the first next()
call, you can't send results/exceptions back in. If all the yields (at the
deepest level) are blank, this might be okay, but it caused me issues when I was
yielding objects to wait for.

I'm also interested in your thoughts on get_future_for(), since that seems to be
one of the more unorthodox ideas of wattle. I can clearly see how it works, but
I have no idea whether I've expressed it well in the description.

Cheers,
Steve