[Python-Dev] Post-PyCon updates to PyParallel

Thu Mar 28 07:26:51 CET 2013

    [ python-dev: I've set up a new list for pyparallel discussions:
      https://lists.snakebite.net/mailman/listinfo/pyparallel.  This
      e-mail will be the last I'll send to python-dev@ regarding the
      on-going pyparallel work; please drop python-dev@ from the CC
      and just send to pyparallel at lists.snakebite.net -- I'll stay on
      top of the posts-from-unsubscribed-users moderation for those that
      want to reply to this e-mail but not subscribe. ]

Hi folks,

    Wanted to give a quick update on the parallel work both during and
    after PyCon.  During the language summit when I presented the slides
    I uploaded to speakerdeck.com, the majority of questions from other
    developers revolved around the big issues like data integrity and
    what happens when parallel objects interact with main-thread objects
    and vice-versa.

    So, during the sprints, I explored putting guards in place to throw
    an exception if we detect that a user has assigned a parallel object
    to a non-protected main-thread object.

        (I describe the concept of 'protection' in my follow up posts to
         python-dev last week: http://mail.python.org/pipermail/python-dev/2013-March/124690.html.

         Basically, protecting a main-thread object allows code like
         this to work without crashing:
            d = async.dict()
            def foo():
                # async.rdtsc() is a helper method
                # that basically wraps the result of
                # the assembly RDTSC (read time-
                # stamp counter) instruction into a
                # PyLong object.  So, it's handy when
                # I need to test the very functionality
                # being demonstrated here (creating
                # an object within a parallel context
                # and persisting it elsewhere).
                d['foo'] = async.rdtsc()

            def bar():
                d['bar'] = async.rdtsc()

            async.submit_work(foo)
            async.submit_work(bar)
         )

    It was actually pretty easy, far easier than I expected.  It was
    achieved via Px_CHECK_PROTECTION():

        https://bitbucket.org/tpn/pyparallel/commits/f3fe082668c6f3f699db990f046291ff66b1b467#LInclude/object.hT1072

    Various new tests related to the protection functionality:

        https://bitbucket.org/tpn/pyparallel/commits/f3fe082668c6f3f699db990f046291ff66b1b467#LLib/async/test/test_primitives.pyT58

    The type of changes I had to make to other parts of CPython to
    perform the protection checks:

        https://bitbucket.org/tpn/pyparallel/commits/f3fe082668c6f3f699db990f046291ff66b1b467#LObjects/abstract.cT170

    That was all working fine... until I started looking at adding
    support for lists (i.e. appending a parallel thread object to a
    protected, main-thread list).

    The problem is that appending to a list will often involve a list
    resize, which is done via PyMem_REALLOC() and some custom fiddling.
    That would mean if a parallel thread attempts to append to a list
    and it needs resizing, all the newly realloc'd memory would be
    allocated from the parallel context's heap.  Now, this heap would
    stick around as long as the parallel objects have a refcount > 0.

    However, as soon as the last parallel object's refcount hits 0, the
    entire context will be scheduled for the cleanup/release/free dance,
    which will eventually blow away the entire heap and all the memory
    allocated against that heap... which means all the **ob_item stuff
    that was reallocated as part of the list resize.

    Not particularly desirable :-)  As I was playing around with ways to
    potentially pre-allocate lists, it occurred to me that dicts would
    be affected in the exact same way; I just hadn't run into it yet
    because my unit tests only ever assigned a few (<5) objects to the
    protected dicts.

    Once the threshold gets reached (10?), a "dict resize" would take
    place, which would involve lots of PyMem_REALLOCs, and we get into
    the exact same situation mentioned above.

    So, at that point, I concluded that whole async protection stuff was
    not a viable long term solution.  (In fact, the reason I first added
    it was simply to have an easy way to test things in unit tests.)

    The new solution I came up with: new thread-safe, interlocked data
    types that are *specifically* designed for this exact use case;
    transferring results from computation in a parallel thread back to
    a main thread 'container' object.

    First up is a new list type: xlist() (PyXListObject/PyXList_Type).
    I've just committed the work-in-progress stuff I've been able to
    hack out whilst traveling the past few days:

        https://bitbucket.org/tpn/pyparallel/commits/5b662eba4efe83e94d31bd9db4520a779aea612a

    It's not finished, and I'm pretty sure it doesn't even compile yet,
    but the idea is something like this:

        results = xlist()

        def worker1(input):
            # do work
            result = useful_work1()
            results.push(result)

        def worker2(input):
            # do work
            result = useful_work2()
            results.push(result)

        data = data_to_process()
        async.submit_work(worker1, data[:len(data)])
        async.submit_work(worker2, data[len(data):])
        async.run()

        for result in results:
            print(result)

    The big change is what happens during xlist.push():

        https://bitbucket.org/tpn/pyparallel/commits/5b662eba4efe83e94d31bd9db4520a779aea612a#LPython/pyparallel.cT3844

+PyObject *
+xlist_push(PyObject *obj, PyObject *src)
+{
+    PyXListObject *xlist = (PyXListObject *)obj;
+    assert(src);
+
+    if (!Py_PXCTX)
+        PxList_PushObject(xlist->head, src);
+    else {
+        PyObject *dst;
+        _PyParallel_SetHeapOverride(xlist->heap_handle);
+        dst = PyObject_Clone(src, "objects of type %s cannot "
+                                  "be pushed to xlists");
+        _PyParallel_RemoveHeapOverride();
+        if (!dst)
+            return NULL;
+        PxList_PushObject(xlist->head, dst);
+    }
+
+    /*
+    if (Px_CV_WAITERS(xlist))
+        ConditionVariableWakeOne(&(xlist->cv));
+    */
+
+    Py_RETURN_NONE;
 }

    Note the heap override and PyObject_Clone(), which currently looks
    like this:

+PyObject *
+PyObject_Clone(PyObject *src, const char *errmsg)
+{
+    int valid_type;
+    PyObject *dst;
+    PyTypeObject *tp;
+
+    tp = Py_TYPE(src);
+
+    valid_type = (
+        PyBytes_CheckExact(src)         ||
+        PyByteArray_CheckExact(src)     ||
+        PyUnicode_CheckExact(src)       ||
+        PyLong_CheckExact(src)          ||
+        PyFloat_CheckExact(src)
+    );
+
+    if (!valid_type) {
+        PyErr_Format(PyExc_ValueError, errmsg, tp->tp_name);
+        return NULL;
+    }
+
+    if (PyLong_CheckExact(src)) {
+
+    } else if (PyFloat_CheckExact(src)) {
+
+    } else if (PyUnicode_CheckExact(src)) {
+
+    } else {
+        assert(0);
+    }
+
+
+}

    Initially, I just want to get support working for simple types that
    are easy to clone.  Any sort of GC/container types will obviously
    take a lot more work as they need to be deep-copied.

    You might also note the Px_CV_WAITERS() bit; these interlocked lists
    could quite easily function as producer/consumer queues, so, maybe
    you could do something like this:

        queue = xlist()

        def consumer(input):
            # do work
            ...

        def producer():
            for i in xrange(100):
                queue.push(i)

        async.submit_queue(queue, consumer)
        async.submit_work(producer)

    Oh, forgot to mention the heap-override specifics: each xlist() gets
    its own heap handle -- when the "pushing" is done and the parallel
    object needs to be copied, the new memory is allocated against the
    xlist's heap.  That heap will stick around until the xlist's refcnt
    hits 0, then everything will be blown away in one fell swoop.

    (Which means I'll need to tweak the memory/refcnt intercepts to
     handle this new concept -- like I had to do to support the notion
     of persisted contexts.  Not a big deal.)

    I really like this approach; much more so than the persisted context
    stuff and the even-more-convoluted promotion stuff (yet to be
    written).  Encapsulating all the memory associated with parallel to
    main-thread object transitions in the very object that is used to
    effect the transition just feels right.

    So, that means there are three main "memory alloc override"-type
    modes currently:

        - Normal.  (Main-thread stuff, ref counting, PyMalloc stuff.)
        - Purely parallel.  (Context-specific heap stuff, very fast.)
        - Parallel->main-thread transitions.  (The stuff above.)

    (Or rather, there will be, once I finish this xlist stuff.  That'll
     allow me to deprecate the TLS override stuff and the Context
     persistence stuff, both of which were nice experiments but fizzled
     out in practice.)

    ....the Px_CHECK_PROTECTION() work was definitely useful though and
    will need to be expanded to all objects.  This will allow us to
    raise an exception if someone attempts to assign a parallel object to
    a normal main thread object (instead of one of the approved
    interlocked/parallel objects (like xlist)).

    Regards,

        Trent.