PyParallel update (was: solving multi-core Python)

On Sat, Jun 20, 2015 at 03:42:33PM -0600, Eric Snow wrote:
So, I've been sprinting relentlessly on PyParallel since Christmas, and recently reached my v0.0 milestone of being able to handle all the TEFB tests, plus get the "instantaneous wiki search" thing working too. The TEFB (Techempower Framework Benchmarks) implementation is here: https://bitbucket.org/tpn/pyparallel/src/8528b11ba51003a9821ceb75683ee96ed33... (The aim was to have it compete in this: https://www.techempower.com/benchmarks/#section=data-r10, but unfortunately they broke their Windows support after round 9, so there's no way to get PyParallel into the official results without fixing that first.) The wiki thing is here: https://bitbucket.org/tpn/pyparallel/src/8528b11ba51003a9821ceb75683ee96ed33... I particularly like the wiki example as it leverages a lot of benefits afforded by PyParallel's approach to parallelism, concurrency and asynchronous I/O: - Load a digital search trie (datrie.Trie) that contains every Wikipedia title and the byte-offset within the wiki.xml where the title was found. (Once loaded the RSS of python.exe is about 11GB; the trie itself has about 16 million items in it.) - Load a numpy array of sorted 64-bit integer offsets. This allows us to do a searchsorted() (binary search) against a given offset in order to derive the next offset. - Once we have a way of getting two byte offsets, we can use ranged HTTP requests (and TransmitFile behind the scenes) to efficiently read random chunks of the file asynchronously. (Windows has a huge advantage here -- there's simply no way to achieve similar functionality on POSIX in a non-blocking fashion (sendfile can block, a disk read() can block, a memory reference into a mmap'd file that isn't in memory will page fault, which will block).) The performance has far surpassed anything I could have imagined back during the async I/O discussions in September 2012, so, time to stick a fork in it and document the experience, which is what I'll be working on in the coming weeks. In the mean time: - There are installers available here for those that wish to play around with the current state of things: http://download.pyparallel.org/ - I wrote a little helper thing that diffs the hg tree against the original v3.3.5 tag I based the work off and committed the diffs directly -- this provides a way to review the changes that were made in order to get to the current level of functionality: https://bitbucket.org/tpn/pyparallel/src/8528b11ba51003a9821ceb75683ee96ed33... (It only includes files that existed in the v3.3.5 tag, I don't include diffs for new files I've added.) It's probably useful reviewing the diffs after perusing pyparallel.h: https://bitbucket.org/tpn/pyparallel/src/8528b11ba51003a9821ceb75683ee96ed33... ....as you'll see lots of guards in place in most of the diffs. E.g.: Py_GUARD() -- make sure we never hit this from a parallel context Px_GUARD() -- make sure we never hit this from a main thread Py_GUARD_OBJ(o) -- make sure object o is always a main thread object Px_GUARD_OBJ(o) -- make sure object o is always a parallel object PyPx_GUARD_OBJ(o) -- if we're a parallel context, make sure it's a parallel object, if we're a main thread, make sure it's a main thread object. If you haven't heard of PyParallel before, this might be a good place to start: https://speakerdeck.com/trent/. The core concepts haven't really changed since here (re: parallel contexts, main thread, main thread objects, parallel thread objects): https://speakerdeck.com/trent/pyparallel-how-we-removed-the-gil-and-exploite... Basically, if we're a main thread, "do what we normally do", if we're a parallel thread, "divert to a thread-safe alternative". And a final note: I like the recent async additions. I mean, it's unfortunate that the new keyword clashes with the module name I used to hide all the PyParallel trickery, but I'm at the point now where calling something like this from within a parallel context is exactly what I need: async f.write(...) async cursor.execute(...) I've been working on PyParallel on-and-off now for ~2.5 years and have learned a lot and churned out a lot of code -- documenting it all is actually somewhat daunting (where do I start?!), so, if anyone has specific questions about how I addressed certain things, I'm more than happy to elicit more detail on specifics. Trent.

On Tue, Jun 23, 2015 at 09:53:01AM -0400, Trent Nelson wrote:
Oops, I was off by about 12 million: C:\PyParallel33>python.exe PyParallel 3.3.5 (3.3-px:829ae345012e+, Jun 15 2015, 16:54:16) [MSC v.1600 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import os >>> os.chdir('examples\\wiki') >>> import wiki as w About to load titles trie, this will take a while... >>> len(w.titles) 27962169

On Tue, Jun 23, 2015 at 7:53 AM, Trent Nelson <trent@snakebite.org> wrote:
Thanks for the update, Trent. I've skimmed through it and will be reading more in-depth when I get a chance. I'm sure I'll have more questions for you. :) -eric

On Tue, Jun 23, 2015 at 09:53:01AM -0400, Trent Nelson wrote:
Oops, I was off by about 12 million: C:\PyParallel33>python.exe PyParallel 3.3.5 (3.3-px:829ae345012e+, Jun 15 2015, 16:54:16) [MSC v.1600 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import os >>> os.chdir('examples\\wiki') >>> import wiki as w About to load titles trie, this will take a while... >>> len(w.titles) 27962169

On Tue, Jun 23, 2015 at 7:53 AM, Trent Nelson <trent@snakebite.org> wrote:
Thanks for the update, Trent. I've skimmed through it and will be reading more in-depth when I get a chance. I'm sure I'll have more questions for you. :) -eric
participants (2)
-
Eric Snow
-
Trent Nelson