Mailman 3 Slides from today's parallel/async Python talk - Python-Dev

Slides from today's parallel/async Python talk

Trent Nelson

14 Mar 2013 14 Mar '13

7:35 a.m.

Just posted the slides for those that didn't have the benefit of attending the language summit today: https://speakerdeck.com/trent/parallelizing-the-python-interpreter-an-altern... Trent.

Show replies by date

Christian Heimes

14 Mar 14 Mar

5:51 p.m.

Am 14.03.2013 03:05, schrieb Trent Nelson:

...

Just posted the slides for those that didn't have the benefit of attending the language summit today:

https://speakerdeck.com/trent/parallelizing-the-python-interpreter-an-altern...

Wow, neat! Your idea with Py_PXCTC is ingenious. As far as I remember the FS and GS segment registers are used by most modern operating systems on x86 and x86_64 platforms nowadays to distinguish threads. TLS is implemented with FS and GS registers. I guess the __read[gf]sdword() intrinsics do exactly the same. Reading registers is super fast and should have a negligible effect on code. ARM CPUs don't have segment registers because they have a simpler addressing model. The register CP15 came up after a couple of Google searches. IMHO you should target x86, x86_64, ARMv6 and ARMv7. ARMv7 is going to be more important than x86 in the future. We are going to see more ARM based servers. Christian

Antoine Pitrou

6:08 p.m.

Le Thu, 14 Mar 2013 13:21:09 +0100, Christian Heimes a écrit :

...

IMHO you should target x86, x86_64, ARMv6 and ARMv7. ARMv7 is going to be more important than x86 in the future. We are going to see more ARM based servers.

Well we can't really see less of them, since there are hardly any ;-) Related reading: http://www.anandtech.com/show/6757/calxedas-arm-server-tested Regards Antoine.

a.cavallo＠cavallinux.eu

6:15 p.m.

By the way on the arm (and any platform that can do cross-compiling) I've created a Makefile based build of the python 2.7.x: https://bitbucket.org/cavallo71/android Please don't be fooled by the Android name, it really can take any crosscompiler (provided it follows the gcc synatx). It was born out of the frustration with trying to adapt ./configure to do cross compiling. It is a sliglty different update to the problem as tried by the kiwy project for example. I hope this helps, Antonio On 2013-03-14 13:38, Antoine Pitrou wrote:

...

Le Thu, 14 Mar 2013 13:21:09 +0100, Christian Heimes a écrit :

...
IMHO you should target x86, x86_64, ARMv6 and ARMv7. ARMv7 is going to be more important than x86 in the future. We are going to see more ARM based servers.

Well we can't really see less of them, since there are hardly any ;-)

Related reading: http://www.anandtech.com/show/6757/calxedas-arm-server-tested

Trent Nelson

11:53 p.m.

On Thu, Mar 14, 2013 at 05:21:09AM -0700, Christian Heimes wrote:

...

Am 14.03.2013 03:05, schrieb Trent Nelson:

...
Just posted the slides for those that didn't have the benefit of attending the language summit today:

https://speakerdeck.com/trent/parallelizing-the-python-interpreter-an-altern...

Wow, neat! Your idea with Py_PXCTC is ingenious.

Yeah, it's funny how the viability and performance of the whole approach comes down to a quirky little trick for quickly detecting if we're in a parallel thread ;-) I was very chuffed when it all fell into place. (And I hope the quirkiness of it doesn't detract from the overall approach.)

...

As far as I remember the FS and GS segment registers are used by most modern operating systems on x86 and x86_64 platforms nowadays to distinguish threads. TLS is implemented with FS and GS registers. I guess the __read[gf]sdword() intrinsics do exactly the same.

Yup, in fact, if I hadn't come up with the __read[gf]sword() trick, my only other option would have been TLS (or the GetCurrentThreadId /pthread_self() approach in the presentation). TLS is fantastic, and it's definitely an intrinsic part of the solution (the "Y" part of "if we're a parallel thread, do Y"), but it definitely more costly than a simple FS/GS register read.

...

Reading registers is super fast and should have a negligible effect on code.

Yeah the actual instruction is practically free; the main thing you pay for is the extra branch. However, most of the code looks like this: if (Py_PXCTX) something_small_and_inlineable(); else Py_INCREF(op); /* also small and inlineable */ In the majority of the cases, all the code for both branches is going to be in the same cache line, so a mispredicted branch is only going to result in a pipeline stall, which is better than a cache miss.

...

ARM CPUs don't have segment registers because they have a simpler addressing model. The register CP15 came up after a couple of Google searches.

Noted, thanks!

...

IMHO you should target x86, x86_64, ARMv6 and ARMv7. ARMv7 is going to be more important than x86 in the future. We are going to see more ARM based servers.

Yeah that's my general sentiment too. I'm definitely curious to see if other ISAs offer similar facilities (Sparc, IA64, POWER etc), but the hierarchy will be x86/x64 > ARM > * for the foreseeable future. Porting the Py_PXCTX part is trivial compared to the work that is going to be required to get this stuff working on POSIX where none of the sublime Windows concurrency, synchronisation and async IO primitives exist.

...

Christian

Trent.

Stefan Ring

15 Mar 15 Mar

1:29 a.m.

...

Yup, in fact, if I hadn't come up with the __read[gf]sword() trick, my only other option would have been TLS (or the GetCurrentThreadId /pthread_self() approach in the presentation). TLS is fantastic, and it's definitely an intrinsic part of the solution (the "Y" part of "if we're a parallel thread, do Y"), but it definitely more costly than a simple FS/GS register read.

I think you should be able to just take the address of a static __thread variable to achieve the same thing in a more portable way.

Trent Nelson

3:53 a.m.

On Thu, Mar 14, 2013 at 12:59:57PM -0700, Stefan Ring wrote:

...

...
Yup, in fact, if I hadn't come up with the __read[gf]sword() trick, my only other option would have been TLS (or the GetCurrentThreadId /pthread_self() approach in the presentation). TLS is fantastic, and it's definitely an intrinsic part of the solution (the "Y" part of "if we're a parallel thread, do Y"), but it definitely more costly than a simple FS/GS register read.

I think you should be able to just take the address of a static __thread variable to achieve the same thing in a more portable way.

Sure, but, uh, that's kinda' trivial in comparison to all the wildly unportable Windows-only functionality I'm using to achieve all of this at the moment :-) For the record, here are all the Windows calls I'm using that have no *direct* POSIX equivalent: Interlocked singly-linked lists: - InitializeSListHead() - InterlockedFlushSList() - QueryDepthSList() - InterlockedPushEntrySList() - InterlockedPushListSList() - InterlockedPopEntrySlist() Synchronisation and concurrency primitives: - Critical sections - InitializeCriticalSectionAndSpinCount() - EnterCriticalSection() - LeaveCriticalSection() - TryEnterCriticalSection() - Slim read/writer locks (some pthread implements have rwlocks)*: - InitializeSRWLock() - AcquireSRWLockShared() - AcquireSRWLockExclusive() - ReleaseSRWLockShared() - ReleaseSRWLockExclusive() - TryAcquireSRWLockExclusive() - TryAcquireSRWLockShared() - One-time initialization: - InitOnceBeginInitialize() - InitOnceComplete() - Generic event, signalling and wait facilities: - CreateEvent() - SetEvent() - WaitForSingleObject() - WaitForMultipleObjects() - SignalObjectAndWait() Native thread pool facilities: - TrySubmitThreadpoolCallback() - StartThreadpoolIo() - CloseThreadpoolIo() - CancelThreadpoolIo() - DisassociateCurrentThreadFromCallback() - CallbackMayRunLong() - CreateThreadpoolWait() - SetThreadpoolWait() Memory management: - HeapCreate() - HeapAlloc() - HeapDestroy() Structured Exception Handling (#ifdef Py_DEBUG): - __try/__except Sockets: - ConnectEx() - AcceptEx() - WSAEventSelect(FD_ACCEPT) - DisconnectEx(TF_REUSE_SOCKET) - Overlapped WSASend() - Overlapped WSARecv() Don't get me wrong, I grew up with UNIX and love it as much as the next guy, but you can't deny the usefulness of Windows' facilities for writing high-performance, multi-threaded IO code. It's decades ahead of POSIX. (Which is also why it bugs me when I see select() being used on Windows, or IOCP being used as if it were a poll-type "generic IO multiplexor" -- that's like having a Ferrari and speed limiting it to 5mph!) So, before any of this has a chance of working on Linux/BSD, a lot more scaffolding will need to be written to provide the things we get for free on Windows (threadpools being the biggest freebie). Trent.

Sturla Molden

21 Mar 21 Mar

6:23 p.m.

Den 14. mars 2013 kl. 23:23 skrev Trent Nelson :

...

For the record, here are all the Windows calls I'm using that have no *direct* POSIX equivalent:

Interlocked singly-linked lists: - InitializeSListHead() - InterlockedFlushSList() - QueryDepthSList() - InterlockedPushEntrySList() - InterlockedPushListSList() - InterlockedPopEntrySlist()

Synchronisation and concurrency primitives: - Critical sections - InitializeCriticalSectionAndSpinCount() - EnterCriticalSection() - LeaveCriticalSection() - TryEnterCriticalSection() - Slim read/writer locks (some pthread implements have rwlocks)*: - InitializeSRWLock() - AcquireSRWLockShared() - AcquireSRWLockExclusive() - ReleaseSRWLockShared() - ReleaseSRWLockExclusive() - TryAcquireSRWLockExclusive() - TryAcquireSRWLockShared() - One-time initialization: - InitOnceBeginInitialize() - InitOnceComplete() - Generic event, signalling and wait facilities: - CreateEvent() - SetEvent() - WaitForSingleObject() - WaitForMultipleObjects() - SignalObjectAndWait()

Native thread pool facilities: - TrySubmitThreadpoolCallback() - StartThreadpoolIo() - CloseThreadpoolIo() - CancelThreadpoolIo() - DisassociateCurrentThreadFromCallback() - CallbackMayRunLong() - CreateThreadpoolWait() - SetThreadpoolWait()

Memory management: - HeapCreate() - HeapAlloc() - HeapDestroy()

Structured Exception Handling (#ifdef Py_DEBUG): - __try/__except

Sockets: - ConnectEx() - AcceptEx() - WSAEventSelect(FD_ACCEPT) - DisconnectEx(TF_REUSE_SOCKET) - Overlapped WSASend() - Overlapped WSARecv()

Don't get me wrong, I grew up with UNIX and love it as much as the next guy, but you can't deny the usefulness of Windows' facilities for writing high-performance, multi-threaded IO code. It's decades ahead of POSIX. (Which is also why it bugs me when I see select() being used on Windows, or IOCP being used as if it were a poll-type "generic IO multiplexor" -- that's like having a Ferrari and speed limiting it to 5mph!)

So, before any of this has a chance of working on Linux/BSD, a lot more scaffolding will need to be written to provide the things we get for free on Windows (threadpools being the biggest freebie).

Have you considered using OpenMP instead of Windows API or POSIX threads directly? OpenMP gives you a thread pool and synchronization primitives for free as well, with no special code needed for Windows or POSIX. OpenBLAS (and GotoBLAS2) uses OpenMP to produce a thread pool on POSIX systems (and actually Windows API on Windows). The OpenMP portion of the C code is wrapped so it looks like sending an asynch task to a thread pool; the C code is not littered with OpenMP pragmas. If you need something like Windows threadpools on POSIX, just look at the BSD licensed OpenBLAS code. It is written to be scalable for the world's largest supercomputers (but also beautifully written and very easy to read). Cython has code to register OpenMP threads as Python threads, in case that is needed. So that problem is also solved. Sturla

Trent Nelson

22 Mar 22 Mar

8:30 a.m.

No, I haven't. I'd lose the excellent Windows pairing of thread pool IO and overlapped IO facilities if I did that. Not saying it isn't an option down the track for the generic "submit work" API though; that stuff will work against any thread pool without too much effort. But for now, the fact that all I need to call is TrySubmitThreadpoolCallback and Windows does *everything* else is pretty handy. Lets me concentrate on the problem instead of getting distracted by scaffolding. This e-mail was sent from a wireless device. On 21 Mar 2013, at 05:53, "Sturla Molden" wrote:

...

Den 14. mars 2013 kl. 23:23 skrev Trent Nelson :

...
For the record, here are all the Windows calls I'm using that have no *direct* POSIX equivalent:

Interlocked singly-linked lists: - InitializeSListHead() - InterlockedFlushSList() - QueryDepthSList() - InterlockedPushEntrySList() - InterlockedPushListSList() - InterlockedPopEntrySlist()

Synchronisation and concurrency primitives: - Critical sections - InitializeCriticalSectionAndSpinCount() - EnterCriticalSection() - LeaveCriticalSection() - TryEnterCriticalSection() - Slim read/writer locks (some pthread implements have rwlocks)*: - InitializeSRWLock() - AcquireSRWLockShared() - AcquireSRWLockExclusive() - ReleaseSRWLockShared() - ReleaseSRWLockExclusive() - TryAcquireSRWLockExclusive() - TryAcquireSRWLockShared() - One-time initialization: - InitOnceBeginInitialize() - InitOnceComplete() - Generic event, signalling and wait facilities: - CreateEvent() - SetEvent() - WaitForSingleObject() - WaitForMultipleObjects() - SignalObjectAndWait()

Native thread pool facilities: - TrySubmitThreadpoolCallback() - StartThreadpoolIo() - CloseThreadpoolIo() - CancelThreadpoolIo() - DisassociateCurrentThreadFromCallback() - CallbackMayRunLong() - CreateThreadpoolWait() - SetThreadpoolWait()

Memory management: - HeapCreate() - HeapAlloc() - HeapDestroy()

Structured Exception Handling (#ifdef Py_DEBUG): - __try/__except

Sockets: - ConnectEx() - AcceptEx() - WSAEventSelect(FD_ACCEPT) - DisconnectEx(TF_REUSE_SOCKET) - Overlapped WSASend() - Overlapped WSARecv()

Don't get me wrong, I grew up with UNIX and love it as much as the next guy, but you can't deny the usefulness of Windows' facilities for writing high-performance, multi-threaded IO code. It's decades ahead of POSIX. (Which is also why it bugs me when I see select() being used on Windows, or IOCP being used as if it were a poll-type "generic IO multiplexor" -- that's like having a Ferrari and speed limiting it to 5mph!)

So, before any of this has a chance of working on Linux/BSD, a lot more scaffolding will need to be written to provide the things we get for free on Windows (threadpools being the biggest freebie).

Have you considered using OpenMP instead of Windows API or POSIX threads directly? OpenMP gives you a thread pool and synchronization primitives for free as well, with no special code needed for Windows or POSIX.

OpenBLAS (and GotoBLAS2) uses OpenMP to produce a thread pool on POSIX systems (and actually Windows API on Windows). The OpenMP portion of the C code is wrapped so it looks like sending an asynch task to a thread pool; the C code is not littered with OpenMP pragmas. If you need something like Windows threadpools on POSIX, just look at the BSD licensed OpenBLAS code. It is written to be scalable for the world's largest supercomputers (but also beautifully written and very easy to read).

Cython has code to register OpenMP threads as Python threads, in case that is needed. So that problem is also solved.

Sturla

Antoine Pitrou

21 Mar 21 Mar

6:47 p.m.

Le Thu, 14 Mar 2013 15:23:37 -0700, Trent Nelson a écrit :

...

Don't get me wrong, I grew up with UNIX and love it as much as the next guy, but you can't deny the usefulness of Windows' facilities for writing high-performance, multi-threaded IO code. It's decades ahead of POSIX.

I suppose that's why all high-performance servers run under Windows. Regards Antoine.

Trent Nelson

22 Mar 22 Mar

8:41 a.m.

http://c2.com/cgi/wiki?BlubParadox ;-) Sent from my iPhone On 21 Mar 2013, at 06:18, "Antoine Pitrou" wrote:

...

Le Thu, 14 Mar 2013 15:23:37 -0700, Trent Nelson a écrit :

...
Don't get me wrong, I grew up with UNIX and love it as much as the next guy, but you can't deny the usefulness of Windows' facilities for writing high-performance, multi-threaded IO code. It's decades ahead of POSIX.

I suppose that's why all high-performance servers run under Windows.

Regards

Antoine.

_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/trent%40snakebite.org

Charles-François Natali

4 Apr 4 Apr

1:48 p.m.

Just a quick implementation question (didn't have time to read through all your emails :-) async.submit_work(func, args, kwds, callback=None, errback=None) How do you implement arguments passing and return value? e.g. let's say I pass a list as argument: how do you iterate on the list from the worker thread without modifying the backing objects for refcounts (IIUC you use a per-thread heap and don't do any refcounting). Same thing for return value, how do you pass it to the callback? cf

Trent Nelson

5 Apr 5 Apr

1:34 a.m.

Hi Charles-François, On Thu, Apr 04, 2013 at 01:18:58AM -0700, Charles-François Natali wrote:

...

Just a quick implementation question (didn't have time to read through all your emails :-)

async.submit_work(func, args, kwds, callback=None, errback=None)

How do you implement arguments passing and return value?

e.g. let's say I pass a list as argument: how do you iterate on the list from the worker thread without modifying the backing objects for refcounts (IIUC you use a per-thread heap and don't do any refcounting).

Correct, nothing special is done for the arguments (apart from incref'ing them in the main thread before kicking off the parallel thread (then decref'ing them in the main thread once we're sure the parallel thread has finished)).

...

Same thing for return value, how do you pass it to the callback?

For submit_work(), you can't :-) In fact, an exception is raised if the func() or callback() or errback() attempts to return a non-None value. It's worth noting that I eventually plan to have the map/reduce-type functionality (similar to what multiprocessing offers) available via a separate 'parallel' façade. This will be geared towards programs that are predominantly single-threaded, but have lots of data that can be processed in parallel at various points. Now, with that being said, there are a few options available at the moment if you want to communicate stuff from parallel threads back to the main thread. Originally, you could do something like this: d = async.dict() def foo(): d['foo'] = async.rdtsc() def bar(): d['bar'] = async.rdtsc() async.submit_work(foo) async.submit_work(bar) But I recently identified a few memory-management flaws with that approach (I'm still on the fence with this issue... initially I was going to drop all support, but I've since had ideas to address the memory issues, so, we'll see). There's also this option: d = dict() @async.call_from_main_thread_and_wait def store(k, v): d[str(k)] = str(v) def foo(): store('foo', async.rdtsc()) def bar(): store('bar', async.rdtsc()) async.submit_work(foo) async.submit_work(bar) (Not a particularly performant option though; the main-thread instantly becomes the bottleneck.) Post-PyCon, I've been working on providing new interlocked data types that are specifically designed to bridge the parallel/main- thread divide: xl = async.xlist() def foo(): xl.push(async.rdtsc()) def bar(): xl.push(async.rdtsc()) async.submit_work(foo) async.submit_work(bar) while True: x = xl.pop() if not x: break process(x) What's interesting about xlist() is that it takes ownership of the parallel objects being pushed onto it. That is, it basically clones them, using memory allocated from its own internal heap (allowing the parallel-thread's context heap to be freed, which is desirable). The push/pop operations are interlocked at the C level, which obviates the need for any explicit locking. I've put that work on hold for now though; I want to finish the async client/server stuff (it's about 60-70% done) first. Once that's done, I'll tackle the parallel.*-type façade. Trent.

Charles-François Natali

12:23 p.m.

Hello,

...

...
async.submit_work(func, args, kwds, callback=None, errback=None)

How do you implement arguments passing and return value?

e.g. let's say I pass a list as argument: how do you iterate on the list from the worker thread without modifying the backing objects for refcounts (IIUC you use a per-thread heap and don't do any refcounting).

Correct, nothing special is done for the arguments (apart from incref'ing them in the main thread before kicking off the parallel thread (then decref'ing them in the main thread once we're sure the parallel thread has finished)).

IIUC you incref the argument from the main thread before publishing it to the worker thread: but what about containers like list? How do you make sure the refcounts of the elements don't get deallocated while the worker thread iterates? More generally, how do you deal with non-local objects? BTW I don't know if you did, but you could probably have a look at Go's goroutines and Erlang processes. cf

Trent Nelson

6:42 p.m.

On Thu, Apr 04, 2013 at 11:53:01PM -0700, Charles-Fran?ois Natali wrote:

...

Hello,

...
...
async.submit_work(func, args, kwds, callback=None, errback=None)

How do you implement arguments passing and return value?

e.g. let's say I pass a list as argument: how do you iterate on the list from the worker thread without modifying the backing objects for refcounts (IIUC you use a per-thread heap and don't do any refcounting).

Correct, nothing special is done for the arguments (apart from incref'ing them in the main thread before kicking off the parallel thread (then decref'ing them in the main thread once we're sure the parallel thread has finished)).

IIUC you incref the argument from the main thread before publishing it to the worker thread: but what about containers like list? How do you make sure the refcounts of the elements don't get deallocated while the worker thread iterates?

Ah, so, all of my examples were missing async.run(). They should have looked like this: async.submit_work(foo) async.submit_work(bar) async.run() async.run() is called from the main thread, with the GIL held, and it blocks until all parallel threads (well, parallel contexts, to be exact) have completed. The parallel 'work' doesn't actually start until async.run() is called either. (That's completely untrue at the moment; async.submit_work(foo) will execute foo() in a parallel thread immediately. Fixing that is on the todo list.) With only parallel threads running, no main-thread objects could ever be deallocated*, as no decref'ing is ever done. [*]: unless you went out of your way to delete/deallocate main thread objects via the @async.call_from_main_thread facility. At the moment, that's firmly in the category of "Don't Do That". (And, thinking about it a little more, I guess I could augment the ceval loop in such a way that in order for the main thread to run things scheduled via @async.call_from_main_thread, all parallel threads need to be suspended. Or I could just freeze/thaw them (although I don't know if there are POSIX counterparts to those Windows methods). That would definitely impede performance, but it would assure data integrity. Perhaps it should be enabled by default, with the option to disable it for consenting adults.)

...

More generally, how do you deal with non-local objects?

Read-only ops against non-local (main-thread) objects from parallel threads are free, which is nice. Things get tricky when you try to mutate main-thread objects from parallel threads. That's where all the context persistence, interlocked data types, object protection etc stuff comes in. Is... that what you mean by how do I deal with non-local objects? I took a guess ;-) Regards, Trent.

"Martin v. Löwis"

15 Mar 15 Mar

4:20 a.m.

Am 14.03.13 12:59, schrieb Stefan Ring:

...

I think you should be able to just take the address of a static __thread variable to achieve the same thing in a more portable way.

That assumes that the compiler supports __thread variables, which isn't that portable in the first place. Regards, Martin

Trent Nelson

4:30 a.m.

On Thu, Mar 14, 2013 at 03:50:27PM -0700, "Martin v. Löwis" wrote:

...

Am 14.03.13 12:59, schrieb Stefan Ring:

...
I think you should be able to just take the address of a static __thread variable to achieve the same thing in a more portable way.

That assumes that the compiler supports __thread variables, which isn't that portable in the first place.

FWIW, I make extensive use of __declspec(thread). I'm aware of GCC and Clang's __thread alternative. No idea what IBM xlC, Sun Studio and others offer, if anything. Trent.

Baptiste Lepilleur

21 Mar 21 Mar

10 p.m.

2013/3/15 Trent Nelson

...

On Thu, Mar 14, 2013 at 03:50:27PM -0700, "Martin v. Löwis" wrote:

...
Am 14.03.13 12:59, schrieb Stefan Ring:

...
I think you should be able to just take the address of a static __thread variable to achieve the same thing in a more portable way.

That assumes that the compiler supports __thread variables, which isn't that portable in the first place.

FWIW, I make extensive use of __declspec(thread). I'm aware of GCC and Clang's __thread alternative. No idea what IBM xlC, Sun Studio and others offer, if anything.

IBM xlC and Sun Studio also support this feature. From memory, it's also __thread keyword. This features is also supported by the new C11/C++11 standards. Baptiste.

Trent Nelson

10:25 p.m.

That's good to hear :-) (It's a fantastic facility, I couldn't imagine having to go back to manual TLS API stuff after using __thread/__declspec(thread).) This e-mail was sent from a wireless device. On 21 Mar 2013, at 09:30, "Baptiste Lepilleur" mailto:baptiste.lepilleur@gmail.com> wrote: 2013/3/15 Trent Nelson mailto:trent@snakebite.org> On Thu, Mar 14, 2013 at 03:50:27PM -0700, "Martin v. Löwis" wrote:

...

Am 14.03.13 12:59, schrieb Stefan Ring:

...
I think you should be able to just take the address of a static __thread variable to achieve the same thing in a more portable way.

That assumes that the compiler supports __thread variables, which isn't that portable in the first place.

FWIW, I make extensive use of __declspec(thread). I'm aware of GCC and Clang's __thread alternative. No idea what IBM xlC, Sun Studio and others offer, if anything. IBM xlC and Sun Studio also support this feature. From memory, it's also __thread keyword. This features is also supported by the new C11/C++11 standards. Baptiste.

"Martin v. Löwis"

15 Mar 15 Mar

4:26 a.m.

Am 14.03.13 11:23, schrieb Trent Nelson:

...

...
ARM CPUs don't have segment registers because they have a simpler addressing model. The register CP15 came up after a couple of Google searches.

Noted, thanks!

Yeah that's my general sentiment too. I'm definitely curious to see if other ISAs offer similar facilities (Sparc, IA64, POWER etc), but the hierarchy will be x86/x64 > ARM > * for the foreseeable future.

Most (in particular the RISC ones) do have a general-purpose register reserved for TLS. For ARM, the interesting thing is that CP15 apparently is not available on all ARM implementations, and Linux then emulates it on processors that don't have it (by handling the trap), which is costly. Additionally, it appears that Android fails to provide that emulation (in some versions, on some processors), so that seems to be tricky ground.

...

Porting the Py_PXCTX part is trivial compared to the work that is going to be required to get this stuff working on POSIX where none of the sublime Windows concurrency, synchronisation and async IO primitives exist.

I couldn't understand from your presentation why this is essential to your approach. IIUC, you are "just" relying on the OS providing a thread pool, (and the sublime concurrency and synchronization routines are nothing more than that, ISTM). Implementing a thread pool on top of select/poll/kqueue seems straight-forward. Regards, Martin

Trent Nelson

4:51 a.m.

On Thu, Mar 14, 2013 at 03:56:33PM -0700, "Martin v. Löwis" wrote:

...

Am 14.03.13 11:23, schrieb Trent Nelson:

...
Porting the Py_PXCTX part is trivial compared to the work that is going to be required to get this stuff working on POSIX where none of the sublime Windows concurrency, synchronisation and async IO primitives exist.

I couldn't understand from your presentation why this is essential to your approach. IIUC, you are "just" relying on the OS providing a thread pool, (and the sublime concurrency and synchronization routines are nothing more than that, ISTM).

Right, there's nothing Windows* does that can't be achieved on Linux/BSD, it'll just take more scaffolding (i.e. we'll need to manage our own thread pool at the very least). [*]: actually, the interlocked singly-linked list stuff concerns me; the API seems straightforward enough but the implementation becomes deceptively complex once you factor in the ABA problem. (I'm not aware of a portable open source alternative for that stuff.)

...

Implementing a thread pool on top of select/poll/kqueue seems straight-forward.

Nod, that's exactly what I've got in mind. Spin up a bunch of threads that sit there and call poll/kqueue in an endless loop. That'll work just fine for Linux/BSD/OSX. Actually, what's really interesting is the new registered IO facilities in Windows 8/2012. The Microsoft recommendation for achieving the ultimate performance (least amount of jitter, lowest latency, highest throughput) is to do something like this: while (1) { if (!DequeueCompletionRequests(...)) { YieldProcessor(); continue; } else { /* Handle requests */ } } That pattern looks a lot more like what you'd do on Linux/BSD (spin up a thread per CPU and call epoll/kqueue endlessly) than any of the previous Windows IO patterns. Trent.

Antoine Pitrou

12:49 p.m.

On Thu, 14 Mar 2013 16:21:14 -0700 Trent Nelson wrote:

...

Actually, what's really interesting is the new registered IO facilities in Windows 8/2012. The Microsoft recommendation for achieving the ultimate performance (least amount of jitter, lowest latency, highest throughput) is to do something like this:

while (1) {

if (!DequeueCompletionRequests(...)) { YieldProcessor(); continue; } else { /* Handle requests */ } }

Does Microsoft change their recommendations every couple of years? :) Regards Antoine.

"Martin v. Löwis"

6:39 p.m.

Am 15.03.13 00:19, schrieb Antoine Pitrou:

...

Does Microsoft change their recommendations every couple of years? :)

Indeed they do. In fact, it's not really the recommendation that changes, but APIs that are added to new Windows releases. In the specific case, Windows 8 adds an API called "Registered IO" (RIO). They (of course) do these API addition in expecting some gain, and then they (of course) promote these new APIs as actually achieving the gain. In the socket APIs, the Unix world went through a similar evolution, with select, poll, epoll, kqueue, and whatnot. The rate at which they change async APIs is actually low, compared to the rate at which they change relational-database APIs (ODBC, ADO, OLEDB, DAO, ADO.NET, LINQ, ... :-) Regards, Martin

Trent Nelson

12:15 a.m.

On Wed, Mar 13, 2013 at 07:05:41PM -0700, Trent Nelson wrote:

...

Just posted the slides for those that didn't have the benefit of attending the language summit today:

https://speakerdeck.com/trent/parallelizing-the-python-interpreter-an-altern...

Someone on /r/python asked if I could elaborate on the "do Y" part of "if we're in a parallel thread, do Y, if not, do X", which I (inadvertently) ended up replying to in detail. I've included the response below. (I'll work on converting this into a TL;DR set of slides soon.)

...

Can you go into a bit of depth about "X" here?

That's a huge topic that I'm hoping to tackle ASAP. The basic premise is that parallel 'Context' objects (well, structs) are allocated for each parallel thread callback. The context persists for the lifetime of the "parallel work". The "lifetime of the parallel work" depends on what you're doing. For a simple ``async.submit_work(foo)``, the context is considered complete once ``foo()`` has been called (presuming no exceptions were raised). For an async client/server, the context will persist for the entirety of the connection. The context is responsible for encapsulating all resources related to the parallel thread. So, it has its own heap, and all memory allocations are taken from that heap. For any given parallel thread, only one context can be executing at a time, and this can be accessed via the ``__declspec(thread) Context *ctx`` global (which is primed by some glue code as soon as the parallel thread starts executing a callback). No reference counting or garbage collection is done during parallel thread execution. Instead, once the context is finished, it is scheduled to be released, which means it'll be "processed" by the main thread as part of its housekeeping work (during ``async.run()`` (technically, ``async.run_once()``). The main thread simply destroys the entire heap in one fell swoop, releasing all memory that was associated with that context. There are a few side effects to this. First, the heap allocator (basically, the thing that answers ``malloc()`` calls) is incredibly simple. It allocates LARGE_PAGE_SIZE chunks of memory at a time (2MB on x64), and simply returns pointers to that chunk for each memory request (adjusting h->next and allocation stats as it goes along, obviously). Once the 2MB has been exhausted, another 2MB is allocated. That approach is fine for the ``submit_(work|timer|wait)`` callbacks, which basically provide a way to run a presumably-finite-length function in a parallel thread (and invoking callbacks/errbacks as required). However, it breaks down when dealing with client/server stuff. Each invocation of a callback (say, ``data_received(...)``) may only consume, say, 500 bytes, but it might be called a million times before the connection is terminated. You can't have cumulative memory usage with possibly-infinite-length client/server-callbacks like you can with the once-off ``submit_(work|wait|timer)`` stuff. So, enter heap snapshots. The logic that handles all client/server connections is instrumented such that it takes a snapshot of the heap (and all associated stats) prior to invoking a Python method (via ``PyObject_Call()``, for example, i.e. the invocation of ``data_received``). When the method completes, we can simply roll back the snapshot. The heap's stats and next pointers et al all get reset back to what they were before the callback was invoked. That's how the chargen server is able to pump out endless streams of data for every client whilst keeping memory usage static. (Well, every new client currently consumes at least a minimum of 2MB (but down the track that can be tweaked back down to SMALL_PAGE_SIZE, 4096, for servers that need to handle hundreds of thousands of clients simultaneously). The only issue with this approach is detecting when the callback has done the unthinkable (from a shared-nothing perspective) and persisted some random object it created outside of the parallel context it was created in. That's actually a huge separate technical issue to tackle -- and it applies just as much to the normal ``submit_(wait|work|timer)`` callbacks as well. I've got a somewhat-temporary solution in place for that currently: d = async.dict() def foo(): # async.rdtsc() is a helper method # that basically wraps the result of # the assembly RDTSC (read time- # stamp counter) instruction into a # PyLong object. So, it's handy when # I need to test the very functionality # being demonstrated here (creating # an object within a parallel context # and persisting it elsewhere). d['foo'] = async.rdtsc() def bar(): d['bar'] = async.rdtsc() async.submit_work(foo) async.submit_work(bar) That'll result in two contexts being created, one for each callback invocation. ``async.dict()`` is a "parallel safe" wrapper around a normal PyDict. This is referred to as "protection". In fact, the code above could have been written as follows: d = async.protect(dict()) What ``protect()`` does is instrument the object such that we intercept ``__getitem__``, ``__setitem__``, ``__getattr__`` and ``__setattr__``. We replace these methods with counterparts that serve two purposes: 1. The read-only methods are wrapped in a read-lock, the write methods are wrapped in a write lock (using underlying system slim read/write locks, which are uber fast). (Basically, you can have unlimited readers holding the read lock, but only one writer can hold the write lock (excluding all the readers and other writers).) 2. Detecting when parallel objects (objects created from within a parallel thread, and thus, backed by the parallel context's heap) have been assigned outside the context (in this case, to a "protected" dict object that was created from the main thread). The first point is important as it ensures concurrent access doesn't corrupt the data structure. The second point is important because it allows us to prevent the persisted object's context from automatically transitioning into the complete->release->heapdestroy lifecycle when the callback completes. This is known as "persistence", as in, a context has been persisted. All sorts of things happen to the object when we detect that it's been persisted. The biggest thing is that reference counting is enabled again for the object (from the perspective of the main thread; ref counting is still a no-op within the parallel thread) -- however, once the refcount hits zero, instead of free()ing the memory like we'd normally do in the main thread (or garbage collecting it), we decref the reference count of the owning context. Once the owning context's refcount goes to zero, we know that no more references exist to objects created from that parallel thread's execution, and we're free to release the context (and thus, destroy the heap -> free the memory). That's currently implemented and works very well. There are a few drawbacks: one, the user must only assign to an "async protected" object. Use a normal dict and you're going to segfault or corrupt things (or worse) pretty quickly. Second, we're persisting the entire context potentially for a single object. The context may be huge; think of some data processing callback that ran for ages, racked up a 100MB footprint, but only generated a PyLong with the value 42 at the end, which consumes, like, 50 bytes (or whatever the size of a PyLong is these days). It's crazy keeping a 100MB context around indefinitely until that PyLong object goes away, so, we need another option. The idea I have for that is "promotion". Rather than persist the context, the object is "promoted"; basically, the parallel thread palms it off to the main thread, which proceeds to deep-copy the object, and take over ownership. This removes the need for the context to be persisted. Now, I probably shouldn't have said "deep-copy" there. Promotion is a terrible option for anything other than simple objects (scalars). If you've got a huge list that consumes 98% of your 100MB heap footprint, well, persistence is perfect. If it's a 50 byte scalar, promotion is perfect. (Also, deep-copy implies collection interrogation, which has all sorts of complexities, so, err, I'll probably end up supporting promotion if the object is a scalar that can be shallow-copied. Any form of collection or non-scalar type will get persisted by default.) I haven't implemented promotion yet (persistence works well enough for now). And none of this is integrated into the heap snapshot/rollback logic -- i.e. we don't detect if a client/server callback assigned an object created in the parallel context to a main-thread object -- we just roll back blindly as soon as the callback completes. Before this ever has a chance of being eligible for adoption into CPython, those problems will need to be addressed. As much as I'd like to ignore those corner cases that violate the shared-nothing approach -- it's inevitable someone, somewhere, will be assigning parallel objects outside of the context, maybe for good reason, maybe by accident, maybe because they don't know any better. Whatever the reason, the result shouldn't be corruption. So, the remaining challenge is preventing the use case alluded to earlier where someone tries to modify an object that hasn't been "async protected". That's a bit harder. The idea I've got in mind is to instrument the main CPython ceval loop, such that we do these checks as part of opcode processing. That allows us to keep all the logic in the one spot and not have to go hacking the internals of every single object's C backend to ensure correctness. Now, that'll probably work to an extent. I mean, after all, there are opcodes for all the things we'd be interested in instrumenting, LOAD_GLOBAL, STORE_GLOBAL, SETITEM etc. What becomes challenging is detecting arbitrary mutations via object calls, i.e. how do we know, during the ceval loop, that foo.append(x) needs to be treated specially if foo is a main-thread object and x is a parallel thread object? There may be no way to handle that *other* than hacking the internals of each object, unfortunately. So, the viability of this whole approach may rest on whether or that's deemed as an acceptable tradeoff (a necessary evil, even) to the Python developer community. If it's not, then it's unlikely this approach will ever see the light of day in CPython. If that turns out to be the case, then I see this project taking the path that Stackless took (forking off and becoming a separate interpreter). There's nothing wrong with that; I am really excited about the possibilities afforded by this approach, and I'm sure it will pique the interest of commercial entities out there that have problems perfectly suited to where this pattern excels (shared-nothing, highly concurrent), much like the relationship that developed between Stackless and Eve Online. So, it'd be great if it eventually saw the light of day in CPython, but that'll be a long way down the track (at least 4.x I'd say), and all these issues that allow you to instantly segfault or corrupt the interpreter will need to be addressed before it's even eligible for *discussion* about inclusion. </snip> Regards, Trent.

Trent Nelson

3 a.m.

Cross-referenced to relevant bits of code where appropriate. (And just a quick reminder regarding the code quality disclaimer: I've been hacking away on this stuff relentlessly for a few months; the aim has been to make continual forward progress without getting bogged down in non-value-add busy work. Lots of wildly inconsistent naming conventions and dead code that'll be cleaned up down the track. And the relevance of any given struct will tend to be proportional to how many unused members it has (homeless hoarder + shopping cart analogy).) On Thu, Mar 14, 2013 at 11:45:20AM -0700, Trent Nelson wrote:

...

The basic premise is that parallel 'Context' objects (well, structs) are allocated for each parallel thread callback.

The 'Context' struct: http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel_priva... Allocated via new_context(): http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l42... ....also relevant, new_context_for_socket() (encapsulates a client/server instance within a context). http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l43... Primary role of the context is to isolate the memory management. This is achieved via 'Heap': http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel_priva... (Which I sort of half started refactoring to use the _HEAD_EXTRA approach when I thought I'd need to have a separate heap type for some TLS avenue I explored -- turns out that wasn't necessary).

...

The context persists for the lifetime of the "parallel work".

The "lifetime of the parallel work" depends on what you're doing. For a simple ``async.submit_work(foo)``, the context is considered complete once ``foo()`` has been called (presuming no exceptions were raised).

Managing context lifetime is one of the main responsibilities of async.run_once(): http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l38...

...

For an async client/server, the context will persist for the entirety of the connection.

Marking a socket context as 'finished' for servers is the job of PxServerSocket_ClientClosed(): http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l68...

...

The context is responsible for encapsulating all resources related to the parallel thread. So, it has its own heap, and all memory allocations are taken from that heap.

The heap is initialized in two steps during new_context(). First, a handle is allocated for the underlying system heap (via HeapCreate): http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l42... The first "heap" is then initialized for use with our context via the Heap_Init(Context *c, size_t n, int page_size) call: http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l19... Heaps are actually linked together via a doubly-linked list. The first heap is a value member (not a pointer) of Context; however, the active heap is always accessed via the '*h' pointer which is updated as necessary. struct Heap { Heap *prev; Heap *next; void *base; void *next; int allocated; int remaining; ... struct Context { Heap heap; Heap *h; ...

...

For any given parallel thread, only one context can be executing at a time, and this can be accessed via the ``__declspec(thread) Context *ctx`` global (which is primed by some glue code as soon as the parallel thread starts executing a callback).

Glue entry point for all callbacks is _PyParallel_EnteredCallback: http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l30... On the topic of callbacks, the main workhorse for the submit_(wait|work) callbacks is _PyParallel_WorkCallback: http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l31... The interesting logic starts at start: http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l32... The interesting part is the error handling. If the callback raises an exception, we check to see if an errback has been provided. If so, we call the errback with the error details. If the callback completes successfully (or it fails, but the errback completes successfully), that is treated as successful callback or errback completion, respectively: http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l32... http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l32... If the errback fails, or no errback was provided, the exception percolates back to the main thread. This is handled at error: http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l33... This should make the behavior of async.run_once() clearer. The first thing it does is check to see if any errors have been posted. http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l39... Errors are returned back to calling code on a first-error-wins basis. (This involves fiddling with the context's lifetime, as we're essentially propagating an object created in a parallel context (the (exception, value, traceback) tuple) back to a main thread context -- so, we can't blow away that context until the exception has had a chance to properly bubble back up and be dealt with.) If there are no errors, we then check to see if any "call from main thread" requests have been made: http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l39... I added support for this in order to ease unit testing, but it has general usefulness. It's exposed via two decorators: @async.call_from_main_thread def foo(arg): ... def callback(): foo('abcd') async.submit_work(callback) That creates a parallel thread, invokes callback(), which then results in foo(arg) eventually being called from the main thread. This would be useful for synchronising access to a database or something like that. There's also @async.call_from_main_thread_and_wait, which I probably should have mentioned first: @async.call_from_main_thread_and_wait def update_login_details(login, details) db.update(login, details) def foo(): ... update_login_details(x, y) # execution will resume when the main thread finishes # update_login_details() ... async.submit_work(foo) Once all "main thread work requests" have been processed, completed callbacks and errbacks are processed. This basically just involves transitioning the associated context onto the "path to freedom" (the lifecycle that eventually results in the context being free()'d and the heap being destroyed). http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l40...

...

No reference counting or garbage collection is done during parallel thread execution. Instead, once the context is finished, it is scheduled to be released, which means it'll be "processed" by the main thread as part of its housekeeping work (during ``async.run()`` (technically, ``async.run_once()``).

The main thread simply destroys the entire heap in one fell swoop, releasing all memory that was associated with that context.

The "path to freedom" lifecycle is a bit complicated at the moment and could definitely use a review. But, basically, the main methods are _PxState_PurgeContexts() and _PxState_FreeContext(); the former checks that the context is ready to be freed, the latter does the actual freeing. _PxState_PurgeContexts: http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l37... _PxState_FreeContext: http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l37... The reason for the separation is to maintain bubbling effect; a context only makes one transition per run_once() invocation. Putting this in place was a key step to stop wild crashes in the early days when unittest would keep hold of exceptions longer than I was expecting -- it should probably be reviewed in light of the new persistence support I implemented (much later).

...

There are a few side effects to this. First, the heap allocator (basically, the thing that answers ``malloc()`` calls) is incredibly simple. It allocates LARGE_PAGE_SIZE chunks of memory at a time (2MB on x64), and simply returns pointers to that chunk for each memory request (adjusting h->next and allocation stats as it goes along, obviously). Once the 2MB has been exhausted, another 2MB is allocated.

_PyHeap_Malloc is the workhorse here: http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l21... Very simple, just keeps nudging along the h->next pointer for each request, allocating another heap when necessary. Nice side effect is that it's ridiculously fast and very cache friendly. Python code running within parallel contexts runs faster than normal main-thread code because of this (plus the boost from not doing any ref counting). The simplicity of this approach made the heap snapshot logic really simple to implement too; taking a snapshot and then rolling back is just a couple of memcpy's and some pointer fiddling.

...

That approach is fine for the ``submit_(work|timer|wait)`` callbacks, which basically provide a way to run a presumably-finite-length function in a parallel thread (and invoking callbacks/errbacks as required).

However, it breaks down when dealing with client/server stuff. Each invocation of a callback (say, ``data_received(...)``) may only consume, say, 500 bytes, but it might be called a million times before the connection is terminated. You can't have cumulative memory usage with possibly-infinite-length client/server-callbacks like you can with the once-off ``submit_(work|wait|timer)`` stuff.

So, enter heap snapshots. The logic that handles all client/server connections is instrumented such that it takes a snapshot of the heap (and all associated stats) prior to invoking a Python method (via ``PyObject_Call()``, for example, i.e. the invocation of ``data_received``).

I came up with the heap snapshot stuff in a really perverse way. The first cut introduced a new 'TLS heap' concept; the idea was that before you'd call PyObject_CallObject(), you'd enable the TLS heap, then roll it back when you were done. i.e. the socket IO loop code had a lot of stuff like this: snapshot = ENABLE_TLS_HEAP(); if (!PyObject_CallObject(...)) { DISABLE_TLS_HEAP_AND_ROLLBACK(snapshot); ... } DISABLE_TLS_HEAP(); ... /* do stuff */ ROLLBACK_TLS_HEAP(snapshot); That was fine initially, until I had to deal with the (pretty common) case of allocating memory from the TLS heap (say, for an async send), and then having the callback picked up by a different thread. That thread then had to return the other thread's snapshot and, well, it just fell apart conceptually. Then it dawned on me to just add the snapshot/rollback stuff to normal Context objects. In retrospect, it's silly I didn't think of this in the first place -- the biggest advantage of the Context abstraction is that it's thread-local, but not bindingly so (as in, it'll only ever run on one thread at a time, but it doesn't matter which one, which is essential, because the ). Once I switched out all the TLS heap cruft for Context-specific heap snapshots, everything "Just Worked". (I haven't removed the TLS heap stuff yet as I'm still using it elsewhere (where it doesn't have the issue above). It's an xxx todo.) The main consumer of this heap snapshot stuff (at the moment) is the socket IO loop logic: http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l56... Typical usage now looks like this: snapshot = PxContext_HeapSnapshot(c, NULL); if (!PxSocket_LoadInitialBytes(s)) { PxContext_RollbackHeap(c, &snapshot); PxSocket_EXCEPTION(); } /* at some later point... */ PxContext_RollbackHeap(c, &snapshot);

...

When the method completes, we can simply roll back the snapshot. The heap's stats and next pointers et al all get reset back to what they were before the callback was invoked.

The only issue with this approach is detecting when the callback has done the unthinkable (from a shared-nothing perspective) and persisted some random object it created outside of the parallel context it was created in.

That's actually a huge separate technical issue to tackle -- and it applies just as much to the normal ``submit_(wait|work|timer)`` callbacks as well. I've got a somewhat-temporary solution in place for that currently:

That'll result in two contexts being created, one for each callback invocation. ``async.dict()`` is a "parallel safe" wrapper around a normal PyDict. This is referred to as "protection".

In fact, the code above could have been written as follows:

d = async.protect(dict())

What ``protect()`` does is instrument the object such that we intercept ``__getitem__``, ``__setitem__``, ``__getattr__`` and ``__setattr__``.

The 'protect' details are pretty hairy. _protect does a few checks: http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l13... ....and then palms things off to _PyObject_PrepOrigType: http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l10... That method is where the magic happens. We basically clone the type object for the object we're protecting, then replace the setitem, getitem etc methods with our counterparts (described next): http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l11... Note the voodoo involved in 'protecting' heap objects versus normal C-type objects, GC objects versus non-GC, etc.

...

We replace these methods with counterparts that serve two purposes:

1. The read-only methods are wrapped in a read-lock, the write methods are wrapped in a write lock (using underlying system slim read/write locks, which are uber fast). (Basically, you can have unlimited readers holding the read lock, but only one writer can hold the write lock (excluding all the readers and other writers).)

2. Detecting when parallel objects (objects created from within a parallel thread, and thus, backed by the parallel context's heap) have been assigned outside the context (in this case, to a "protected" dict object that was created from the main thread).

This is handled via _Px_objobjargproc_ass: http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l90... That is responsible for detecting when a parallel object is being assigned to a non-parallel object (and tries to persist the object where necessary).

...

The first point is important as it ensures concurrent access doesn't corrupt the data structure.

The second point is important because it allows us to prevent the persisted object's context from automatically transitioning into the complete->release->heapdestroy lifecycle when the callback completes.

This is known as "persistence", as in, a context has been persisted. All sorts of things happen to the object when we detect that it's been persisted. The biggest thing is that reference counting is enabled again for the object (from the perspective of the main thread; ref counting is still a no-op within the parallel thread) -- however, once the refcount hits zero, instead of free()ing the memory like we'd normally do in the main thread (or garbage collecting it), we decref the reference count of the owning context.

That's the job of _Px_TryPersist (called via _Px_objobjargproc_ass as mentioned above): http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l86... That makes use of yet-another-incredibly-useful-Windows-feature called 'init once'; basically, underlying system support for ensuring something only gets done *once*. Perfect for avoiding race conditions.

...

Once the owning context's refcount goes to zero, we know that no more references exist to objects created from that parallel thread's execution, and we're free to release the context (and thus, destroy the heap -> free the memory).

All that magic is the unfortunate reason my lovely Py_INCREF/DECREF overrides when from very simple to quite-a-bit-more-involved. i.e. originally Py_INCREF was just: #define Py_INCREF(o) (Py_PXCTX ? (void)0; Py_REFCNT(o)++); With the advent of parallel object persistence and context-specific refcounts, things become less simple: Py_INCREF: http://hg.python.org/sandbox/trent/file/7148209d5490/Include/object.h#l890 890 __inline 891 void 892 _Py_IncRef(PyObject *op) 893 { 894 if ((!Py_PXCTX && (Py_ISPY(op) || Px_PERSISTED(op)))) { 895 _Py_INC_REFTOTAL; 896 (((PyObject*)(op))->ob_refcnt++); 897 } 898 } Py_DECREF: http://hg.python.org/sandbox/trent/file/7148209d5490/Include/object.h#l911 909 __inline 910 void 911 _Py_DecRef(PyObject *op) 912 { 913 if (!Py_PXCTX) { 914 if (Px_PERSISTED(op)) 915 Px_DECREF(op); 916 else if (!Px_ISPX(op)) { 917 _Py_DEC_REFTOTAL; 918 if ((--((PyObject *)(op))->ob_refcnt) != 0) { 919 _Py_CHECK_REFCNT(op); 920 } else 921 _Py_Dealloc((PyObject *)(op)); 922 } 923 } 924 }

...

That's currently implemented and works very well. There are a few drawbacks: one, the user must only assign to an "async protected" object. Use a normal dict and you're going to segfault or corrupt things (or worse) pretty quickly.

Second, we're persisting the entire context potentially for a single object. The context may be huge; think of some data processing callback that ran for ages, racked up a 100MB footprint, but only generated a PyLong with the value 42 at the end, which consumes, like, 50 bytes (or whatever the size of a PyLong is these days).

It's crazy keeping a 100MB context around indefinitely until that PyLong object goes away, so, we need another option. The idea I have for that is "promotion". Rather than persist the context, the object is "promoted"; basically, the parallel thread palms it off to the main thread, which proceeds to deep-copy the object, and take over ownership. This removes the need for the context to be persisted.

Now, I probably shouldn't have said "deep-copy" there. Promotion is a terrible option for anything other than simple objects (scalars). If you've got a huge list that consumes 98% of your 100MB heap footprint, well, persistence is perfect. If it's a 50 byte scalar, promotion is perfect. (Also, deep-copy implies collection interrogation, which has all sorts of complexities, so, err, I'll probably end up supporting promotion if the object is a scalar that can be shallow-copied. Any form of collection or non-scalar type will get persisted by default.)

I haven't implemented promotion yet (persistence works well enough for now). And none of this is integrated into the heap snapshot/rollback logic -- i.e. we don't detect if a client/server callback assigned an object created in the parallel context to a main-thread object -- we just roll back blindly as soon as the callback completes.

Before this ever has a chance of being eligible for adoption into CPython, those problems will need to be addressed. As much as I'd like to ignore those corner cases that violate the shared-nothing approach -- it's inevitable someone, somewhere, will be assigning parallel objects outside of the context, maybe for good reason, maybe by accident, maybe because they don't know any better. Whatever the reason, the result shouldn't be corruption.

So, the remaining challenge is preventing the use case alluded to earlier where someone tries to modify an object that hasn't been "async protected". That's a bit harder. The idea I've got in mind is to instrument the main CPython ceval loop, such that we do these checks as part of opcode processing. That allows us to keep all the logic in the one spot and not have to go hacking the internals of every single object's C backend to ensure correctness.

Now, that'll probably work to an extent. I mean, after all, there are opcodes for all the things we'd be interested in instrumenting, LOAD_GLOBAL, STORE_GLOBAL, SETITEM etc. What becomes challenging is detecting arbitrary mutations via object calls, i.e. how do we know, during the ceval loop, that foo.append(x) needs to be treated specially if foo is a main-thread object and x is a parallel thread object?

There may be no way to handle that *other* than hacking the internals of each object, unfortunately. So, the viability of this whole approach may rest on whether or that's deemed as an acceptable tradeoff (a necessary evil, even) to the Python developer community.

Actually, I'd sort of forgotten that I started adding protection support for lists in _PyObject_PrepOrigType. Well, technically, support for intercepting PySequenceMethods: http://hg.python.org/sandbox/trent/file/7148209d5490/Python/pyparallel.c#l11... I settled for just intercepting PyMappingMethods initially, which is why that chunk of code is commented out. Intercepting the mapping methods allowed me to implement the async protection for dicts and generic objects, which was sufficient for testing purposes at the time. So, er, I guess my point is that automatically detecting object mutation might not be as hard as I'm alluding to above. I'll be happy if we're able to simply raise an exception if you attempt to mutate a non-protected main-thread object. That's infinitely better than segfaulting or silent corruption. Trent.

Trent Nelson

3:19 a.m.

On Thu, Mar 14, 2013 at 02:30:14PM -0700, Trent Nelson wrote:

...

Then it dawned on me to just add the snapshot/rollback stuff to normal Context objects. In retrospect, it's silly I didn't think of this in the first place -- the biggest advantage of the Context abstraction is that it's thread-local, but not bindingly so (as in, it'll only ever run on one thread at a time, but it doesn't matter which one, which is essential, because the ).

Once I switched ...

$10 if you can guess when I took a break for lunch. "....but it doesn't matter which one, which is essential, because there are no guarantees with regards to which thread runs which context." Is along the lines of what I was going to say. Trent.

Christian Tismer

19 Mar 19 Mar

5:57 a.m.

Hi Trent, I just started to try to understand the idea and the implications. Removing almost all of your message since that is already too long to work with: The reference is http://mail.python.org/pipermail/python-dev/2013-March/124690.html On 3/14/13 11:45 AM, Trent Nelson wrote:

...

On Wed, Mar 13, 2013 at 07:05:41PM -0700, Trent Nelson wrote:

...
Just posted the slides for those that didn't have the benefit of attending the language summit today:

https://speakerdeck.com/trent/parallelizing-the-python-interpreter-an-altern...

Someone on /r/python asked if I could elaborate on the "do Y" part of "if we're in a parallel thread, do Y, if not, do X", which I (inadvertently) ended up replying to in detail. I've included the response below. (I'll work on converting this into a TL;DR set of slides soon.)

...
Can you go into a bit of depth about "X" here? That's a huge topic that I'm hoping to tackle ASAP. The basic premise is that parallel 'Context' objects (well, structs) are allocated for each parallel thread callback. The context persists for the lifetime of the "parallel work".

...

So, the remaining challenge is preventing the use case alluded to earlier where someone tries to modify an object that hasn't been "async protected". That's a bit harder. The idea I've got in mind is to instrument the main CPython ceval loop, such that we do these checks as part of opcode processing. That allows us to keep all the logic in the one spot and not have to go hacking the internals of every single object's C backend to ensure correctness.

Now, that'll probably work to an extent. I mean, after all, there are opcodes for all the things we'd be interested in instrumenting, LOAD_GLOBAL, STORE_GLOBAL, SETITEM etc. What becomes challenging is detecting arbitrary mutations via object calls, i.e. how do we know, during the ceval loop, that foo.append(x) needs to be treated specially if foo is a main-thread object and x is a parallel thread object?

There may be no way to handle that *other* than hacking the internals of each object, unfortunately. So, the viability of this whole approach may rest on whether or that's deemed as an acceptable tradeoff (a necessary evil, even) to the Python developer community.

This is pretty much my concern: In order to make this waterproof, as required for CPython, you will quite likely have to do something on very many objects, and this is hard to chime into CPython.

...

If it's not, then it's unlikely this approach will ever see the light of day in CPython. If that turns out to be the case, then I see this project taking the path that Stackless took (forking off and becoming a separate interpreter).

We had that discussion quite often for Stackless, and I would love to find a solution that allows to add special versions and use cases to CPython in a way that avoids the forking as we did it. It would be a nice thing if we could come up with a way to keep CPython in place, but to swap the interpreter out and replace it with a specialized version, if the application needs it. I wonder to what extent that would be possible. What I would like to achieve, after having given up on Stackless integration is a way to let it piggyback onto CPython that works like an extension module, although it hat effectively replace larger parts of the interpreter. I wonder if that might be the superior way to have more flexibility, without forcing everything and all go into CPython. If we can make the interpreter somehow pluggable at runtime, a lot of issues would become much simpler.

...

There's nothing wrong with that; I am really excited about the possibilities afforded by this approach, and I'm sure it will pique the interest of commercial entities out there that have problems perfectly suited to where this pattern excels (shared-nothing, highly concurrent), much like the relationship that developed between Stackless and Eve Online.

What do you think: does it make sense to think of a framework that allows to replace the interpreter at runtime, without making normal CPython really slower? cheers - chris -- Christian Tismer :^) mailto:tismer@stackless.com Software Consulting : Have a break! Take a ride on Python's Karl-Liebknecht-Str. 121 : *Starship* http://starship.python.net/ 14482 Potsdam : PGP key -> http://pgp.uni-mainz.de phone +49 173 24 18 776 fax +49 (30) 700143-0023 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/

Trent Nelson

6:56 a.m.

On Mon, Mar 18, 2013 at 05:27:33PM -0700, Christian Tismer wrote:

...

Hi Trent,

Hi Christian! Thanks for taking the time to read my walls of text ;-)

...

...
So, the remaining challenge is preventing the use case alluded to earlier where someone tries to modify an object that hasn't been "async protected". That's a bit harder. The idea I've got in mind is to instrument the main CPython ceval loop, such that we do these checks as part of opcode processing. That allows us to keep all the logic in the one spot and not have to go hacking the internals of every single object's C backend to ensure correctness.

Now, that'll probably work to an extent. I mean, after all, there are opcodes for all the things we'd be interested in instrumenting, LOAD_GLOBAL, STORE_GLOBAL, SETITEM etc. What becomes challenging is detecting arbitrary mutations via object calls, i.e. how do we know, during the ceval loop, that foo.append(x) needs to be treated specially if foo is a main-thread object and x is a parallel thread object?

There may be no way to handle that *other* than hacking the internals of each object, unfortunately. So, the viability of this whole approach may rest on whether or that's deemed as an acceptable tradeoff (a necessary evil, even) to the Python developer community.

This is pretty much my concern: In order to make this waterproof, as required for CPython, you will quite likely have to do something on very many objects, and this is hard to chime into CPython.

Actually, I think I was unnecessarily pessimistic here. When I sent that follow-up mail with cross-references, I realized I'd forgotten the nitty gritty details of how I implemented the async protection support. It turns out I'd already started on protecting lists (or rather, PySequenceMethods), but decided to stop as the work I'd done on the PyMappingMethods was sufficient for my needs at the time. All I *really* want to do is raise an exception if a parallel object gets assigned to a main-thread container object (list/dict etc) that hasn't been "async protected". (As opposed to now, where it'll either segfault or silently corrupt stuff, then segfault later.) I've already got all the infrastructure in place to test that (I use it extensively within pyparallel.c): Py_ISPY(obj) - detect a main-thread object Py_ISPX(obj) - detect a parallel-thread object Py_IS_PROTECTED(obj) - detect if a main-thread object has been protected* [*]: actually, this isn't in a macro form right now, it's a cheeky inline: __inline char _protected(PyObject *obj) { return (obj->px_flags & Py_PXFLAGS_RWLOCK); } As those macros are exposed in the public , they can be used in other parts of the code base. So, it's just a matter of finding the points where an `lvalue = rvalue` takes place; where: ``Py_ISPY(lvalue) && Py_ISPX(rvalue)``. Then a test to see if lvalue is protected; if not, raise an exception. If so, then nothing else needs to be done. And there aren't that many places where this happens. (It didn't take long to get the PyMappingMethods intercepts nailed down.) That's the idea anyway. I need to get back to coding to see how it all plays out in practice. "And there aren't many places where this happens" might be my famous last words.

...

...
If it's not, then it's unlikely this approach will ever see the light of day in CPython. If that turns out to be the case, then I see this project taking the path that Stackless took (forking off and becoming a separate interpreter).

We had that discussion quite often for Stackless, and I would love to find a solution that allows to add special versions and use cases to CPython in a way that avoids the forking as we did it.

It would be a nice thing if we could come up with a way to keep CPython in place, but to swap the interpreter out and replace it with a specialized version, if the application needs it. I wonder to what extent that would be possible. What I would like to achieve, after having given up on Stackless integration is a way to let it piggyback onto CPython that works like an extension module, although it hat effectively replace larger parts of the interpreter. I wonder if that might be the superior way to have more flexibility, without forcing everything and all go into CPython. If we can make the interpreter somehow pluggable at runtime, a lot of issues would become much simpler.

...
There's nothing wrong with that; I am really excited about the possibilities afforded by this approach, and I'm sure it will pique the interest of commercial entities out there that have problems perfectly suited to where this pattern excels (shared-nothing, highly concurrent), much like the relationship that developed between Stackless and Eve Online.

What do you think: does it make sense to think of a framework that allows to replace the interpreter at runtime, without making normal CPython really slower?

I think there may actually be some interest in what you're suggesting. I had a chat with various other groups over PyCon that had some interesting things in the pipeline, and being able to call back out to CPython internals like you mentioned would be useful to them. I don't want to take my pyparallel work in that direction yet; I'm still hoping I can fix all the show stoppers and have the whole thing eligible for CPython inclusion one day ;-) (So, I'm like you, 10 years ago? :P) Regards, Trent.

4038

Age (days ago)

4060

Last active (days ago)

List overview

Download

27 comments

10 participants

participants (10)

"Martin v. Löwis"
a.cavallo＠cavallinux.eu
Antoine Pitrou
Baptiste Lepilleur
Charles-François Natali
Christian Heimes
Christian Tismer
Stefan Ring
Sturla Molden
Trent Nelson

Slides from today's parallel/async Python talk

a.cavallo＠cavallinux.eu

Sturla Molden

tags

participants (10)