[Python-ideas] [Python-Dev] PyParallel: alternate async I/O and GIL removal

Sun Nov 17 23:34:32 CET 2013

    (I saw that there were a number of additional e-mails echo'ing
     Guido's sentiment/concerns re: shared nothing.  I picked this
     thread to reply to and tried to provide as much info as possible
     in lieu of replying to everyone individually.)

On Sat, Nov 16, 2013 at 06:56:11PM -0800, Guido van Rossum wrote:
>    I wish you had spent more time on explaining how IOCP works and less on
>    judging other approaches.

    Heh, it's funny, with previous presentations, I didn't labor the
    point anywhere near as much, and I found that when presenting to
    UNIX people, they were very defensive of the status quo.

    I probably over-compensated a little too much in the opposite
    direction this time; I don't think anyone is going to argue
    vehemently that the UNIX status quo is optimal on Windows; but
    a side-effect is that it unnecessarily slanders existing bodies
    of work (Twisted et al) that have undeniably improved the overall
    ecosystem over the past decade.

>    Summarizing my understanding of what you're saying, it seems the "right"
>    way to use IOCP on a multi-core machine is to have one thread per core
>    (barring threads you need for unavoidably blocking stuff) and to let the
>    kernel schedule callbacks on all those threads. As long as the callbacks
>    don't block and events come in at a rate to keep all those cores busy this
>    will be optimal.

    The only thing I'd add is that, when speaking in terms of socket
    servers and whatnot, it helps to visualize Python callbacks as "the
    bits of logic that need to run before invoking the next asynchronous
    call".

    Anything I/O related can be done via an asynchronous call; that's
    basically the exit point of the processing thread -- it dispatches
    the async WSARecv() (for example), then moves onto the next request
    in the I/O completion port's queue.

    When that WSARecv() returns, we get all the info we need from the
    completion context to figure out what we just did, and, based on the
    protocol we provided, what needs to be done next.

    So, we do a little more pure Python processing and then dispatch the
    next asynchronous call, which, in this case, might be a WSASend();
    the thread then moves onto the next request in the queue.

    That's all handled by the PxSocket_IOLoop monstrosity:

        http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Python/pyparallel.c#l6246

    I got the inspiration for that implementation from CEval_FrameEx;
    you basically have one big inline method where you can go from
    anything to anything without having to call additional C functions;
    thus, doing back-to-back sends, for example, won't exhaust your
    stack.

    That allows us to do the dynamic switch between sync and async
    depending on protocol preference, current client load, number of
    active I/O hogs, that sort of thing:

        http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Python/pyparallel.c#l6467

    PxSocket_IOLoop currently only handles 1:1 TCP/IP connections, which
    limits its applicability.  I want to expand that -- I should be able
    to connect any sort of end points together in any fashion -- similar
    to how ZeroMQ allows the bridge/fan-out/router type composition.

    An endpoint would be anything that allows me to initiate an async
    operation against it, e.g. file, device, socket, whatever.  This
    is where Windows really shines, because you can literally do
    everything either synchronously or asynchronously.  There should
    also be support for 1:m and m:n relationships between endpoints
    (i.e. an IRC chat server).

    So I see PxSocket_IOLoop turning into a more generic PxThread_Loop
    that can handle anything-to-anything -- basically, calling the
    Python code that needs to run before dispatching the next async
    call.

    The current implementation also does a lot of live introspection
    against the protocol object to figure out what to do next; i.e.
    first entry point for a newly-connected client is here:

        http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Python/pyparallel.c#l6326

    At every entry point into the loop, and at every point *after* the
    relevant Python code has been run, we're relying on the protocol to
    tell us what to do next in a very hard-coded fashion.

    I think for PxThread_Loop to become truly dynamic, it should mirror
    CEval_FrameEx even closer; the protocol analysis should be done
    separately, the output of which is a stream of async-opcode bytes
    that direct the main dispatching logic:

        http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Python/pyparallel.c#l6305

    dispatch:
        switch (next_opcode) {
            TARGET(maybe_shutdown_send_or_recv);
            TARGET(handle_error);
            TARGET(connection_made_callback);
            TARGET(data_received_callback);
            TARGET(send_complete_callback);
            TARGET(overlapped_recv_callback);
            TARGET(post_callback_that_supports_sending_retval);
            TARGET(post_callback_that_does_not_support_sending_retval);
            TARGET(close_);
            TARGET(try_send);

            default:
                break;
        }

    Then we'd have one big case statement just like with CEval_FrameEx
    that handles all possible async-opcodes, rather than the goto
    spaghetti in the current PxSocket_IOLoop.

    The async opcodes would be generic and platform-independent; i.e.
    file write, file read, single socket write, multi-socket write, etc.

    On Windows/Solaris/AIX, everything could be handled asynchronously,
    on other platforms, you would have to fake it using an event loop +
    multiplex method, identical to how twisted/tornado/tulip do it
    currently.

>    But this is almost tautological. It only works if the threads don't
>    communicate with each other or with the main thread (all shared data must
>    be read-only). But heh, if that's all, one process per core works just as
>    well. :-)

    Ok, so, heh, I lied in the presentation.  The main thread won't be
    frozen per-se, and the parallel threads will have a way to share
    state.  I've already done a huge amount of work on this, but it's
    very involved and that presentation was long enough as it is.

    Also, it's easier to understand why reference counting and GC isn't
    needed in parallel contexts if you just assume the main thread isn't
    running.

    In reality, one of the first things I had to figure out was how
    these parallel contexts could communicate state back to the main
    thread -- because without this ability, how the heck would you
    propagate an exception raised in a parallel thread back to the main
    thread?  The exception will be backed my memory allocated in the
    parallel context -- that can't be free'd until the exception has
    been dealt with and no references to it remain.

    As that was one of the first problems I had to solve, it has one of
    the hackiest implementations :-)  The main thread's async.run_once()
    implementation can detect which parallel threads raised exceptions
    (because they've done an interlocked push to the main thread's error
    list) and it will extend the lifetime of the context for an
    additional number of subsequent runs of run_once().  Once the TTL of
    the context drops to 0, it is finally released.

    The reason it's hacky is because there's no direct correlation
    between the exception object finally having no references to it and
    the point we destroy the context.  If you persisted the exception
    object to a list in the main thread somewhere, you'd segfault down
    the track when trying to access that memory.

    So, on the second iteration, I came up with some new concepts;
    context persistence and object promotion.  A main-thread list or
    dict could be async protected such that this would work:

        # main thread
        d1 = {}
        l1 = []
        async.protect(d1)
        async.protect(l1)
        # this would also work
        d2 = async.dict()
        l2 = async.list()

        # fyi: async.rdtsc() returns a PyLong wrapped
        # version of the CPU TSC; handy for generating
        # non-interned objects allocated from a parallel
        # context

        def callback(name):
            d1[name] = async.rdtsc()
            l1.append(async.rdtsc())

        async.submit_work(callback, 'foo')
        async.submit_work(callback, 'bar')
        async.submit_work(callback, 'moo')
        async.run()

    That actually works; the async.protect() call intercepts the
    object's tp_as_mapping and tp_as_sequence fields and redirects them
    to thread-safe versions that use read/write locks.  It also toggles
    a persistence bit on both the parallel context and the parallel long
    object, such that reference counting *is* actually enabled on it
    once it's back in the main thread -- when the ref count drops to 0,
    we check to see if it's an object that's been persisted, and if so,
    we decref the original context -- when the context's refcount gets
    to 0, only *then* do we free it.

        (I also did some stuff where you could promote simple objects
        where it made sense -- i.e. there's no need to keep a 4k context
        around if the end result was a scalar that could be represented
        in 50-200 bytes; just memcpy it from the main thread ("promote
        it to a main thread object with reference counting") and free
        the context.)

    You can see some examples of the different type of stuff you can do
    here:

        http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Lib/async/test/test_primitives.py

    The problem though was that none of my unit tests assigned more than
    ten items to a list/dict, so I never encountered a resize :-)

    You can imagine what happens when a resize takes place within a
    parallel context -- the list/dict is realloc'd using the parallel
    context heap allocator -- that's not ideal, it's a main thread
    object, it shouldn't be reallocated with temporary parallel thread
    memory.

    I think that was the point where I went "oh, bollocks!" and switched
    over to tackling the async socket stuff.

    However, the async socket work forced me to implement all sorts of
    new concepts, including the heap snapshots and TLS heap overrides
    (for interned strings).

    Pair that with the page locking stuff and I have a much richer set
    of tools at my disposal to solve that problem -- I just need to
    completely overhaul everything memory related now that I know how it
    needs to be implemented :-)

    As for the dict/list assignment/resize, the key problem is figuring
    out whether a PyObject_Realloc call is taking place because we're
    resizing a main thread container object -- that's not an easy thing
    to figure out -- all you have is a pointer at the time you need to
    make the decision.

    That's where the memory refactoring work comes in -- I'm still
    working on the details, but the general idea is that you'll be able
    to do very efficient pointer address tests against known base
    address masks to figure out the origin of the object and how the
    current memory request needs to be satisfied.

    The other option I played around with was an interlocked list type
    that is exposed directly to Python:

        x = async.xlist()

        def callback():
            x.push(async.rdtsc())

        for _ in xrange(0, 10):
            async.submit_work(callback)

        async.run()
        # interlocked flush of all results into a list.
        results = x.flush()

    The key difference between an interlocked list and a normal list is
    that an interlocked list has its very own localized heap, just like
    parallel contexts have; pushing a scalar onto the list automatically
    "promotes it".  That is, the object is memcpy'd directly using the
    xlist's heap, and we can keep that heap alive independently to the
    parallel contexts that pushed objects onto it.

    I was also planning on using this as a waitable queue, so you could
    compose pipelines of producers/consumers and that sort of thing.
    Then I ran out of time :-)

>    I don't really care how well CHARGEN (I had to look it up) scales. For
>    HTTP, it's great for serving static contents from a cache or from the
>    filesystem, but if that's all you serve, why use Python? Real web apps use
>    intricate combinations of databases, memcache, in-memory cache, and
>    template expansion. The biggest difference you can make there is probably
>    getting rid of the ORM in favor of more direct SQL, and next on the list
>    would be reimplementing template expansion in C. (And heck, you could
>    release the GIL while you're doing that. :-)

    Agree with the general sentiment "if that's all you're doing, why
    use Python?".  The async HTTP server should allow other things to be
    built on top of it such that it's adding value over and above, say,
    an apache instance serving static files.

>          And in hindsight, perhaps I need to put more emphasis on the fact
>          that it *is* very experimental work with a long-term view, versus
>          Tulip/asyncio, which was intended for *now*.  So although Tulip and
>          PyParallel spawned from the same discussions and are attempting to
>          attack the same problem -- it's really not fair for me to discredit
>          Tulip/Twisted in favor of PyParallel because they're on completely
>          different playing fields with vastly different implementation time
>          frames (I'm thinking 5+ years before this work lands in a mainstream
>          Python release -- if it ever does.  And if not, hey, it can live on
>          as another interpreter, just like Stackless et al).
> 
>    I would love it if you could write a list of things a callback *cannot* do
>    when it is in parallel mode. I believe that list includes mutating any
>    kind of global/shared state (any object created in the main thread is
>    read-only in parallel mode -- it seems you had to work hard to make string
>    interning work, which is semantically transparent but mutates hidden
>    global state). In addition (or, more likely, as a consequence!) a callback
>    cannot create anything that lasts beyond the callback's lifetime, except
>    for the brief time between the callback's return and the completion of the
>    I/O operation involving the return value. (Actually, I missed how you do
>    this -- doesn't this mean you cannot release the callback's heap until
>    much later?)

    So, I think I already answered that above.  The next presentation
    (PyCon Montreal) will be purely focused on this stuff -- I've been
    beating the alternate approach to async I/O for long enough ;-)

>    So it seems that the price for extreme concurrency is the same as always
>    -- you can only run purely functional code. Haskell fans won't mind, but
>    for Python this seems to be putting the cart before the horse -- who wants
>    to write Python with those constraints?

    Basically it's all still work in progress, but the PyParallel-for-
    parallel-compute use case is very important.  And there's no way
    that can be done without having a way to return the results of
    parallel computation back into the next stage of your pipeline where
    more analysis is done.

    Getting hired by Continuum is actually great for this use case;
    we're in the big data, parallel task decomposition space, after all,
    not the writing-async-socket-server business ;-)  I know Peter and
    Travis are both very supportive of PyParallel so its just a matter
    of trying to find time to work on it between consultancy engagements.

        Trent.