PyParallel-style threads
Hi all, There was an experiment based on CPython's code called PyParallel <https://github.com/pyparallel/pyparallel> that allows running threads in parallel without STM and modifying source code of both Python and C extensions. The only limitation is that they disallow mutation of global state in parallel context. I briefly mentioned it before on PyPy's freenode channel. I'd like to discuss why the approach is useful, how it can benefit PyPy users and how can it be implemented. Allowing to run in parallel without mutating global state can help servers use each thread to handle a request. It can also allow to log in parallel or send an HTTP request (or an AMQP message) without sharing the response with the main thread. This is useful in some cases and since PyParallel managed to keep the same semantics it (shouldn't) break CPyExt. If we keep to the following rules: 1. No global state mutation is allowed 2. No new keywords or code modifications required 3. No CPyExt code is allowed (for now) I believe that users can somewhat benefit from this implementation if done correctly. As for implementation, if we can trace the code running in the thread and ensure it's not mutating global state and that CPyExt is never used during the thread's course we can simply release the GIL when such a thread is run. That requires less knowledge than using STM and less code modifications. However I think that attempting to do so will introduce the same issue with caching traces (Armin am I correct here?). As for CPyExt, we could copy the same code modifications that PyParallels did but I suspect that it will be so slow that the benefit of running in parallel will be completely lost for all cases but very long threads. Is what I'm suggesting even possible? How challenging will it be? Thanks, Omer Katz.
so quick question - what's the win compared to multiple processes? On Mon, Jun 20, 2016 at 8:51 AM, Omer Katz <omer.drow@gmail.com> wrote:
Hi all, There was an experiment based on CPython's code called PyParallel that allows running threads in parallel without STM and modifying source code of both Python and C extensions. The only limitation is that they disallow mutation of global state in parallel context. I briefly mentioned it before on PyPy's freenode channel. I'd like to discuss why the approach is useful, how it can benefit PyPy users and how can it be implemented. Allowing to run in parallel without mutating global state can help servers use each thread to handle a request. It can also allow to log in parallel or send an HTTP request (or an AMQP message) without sharing the response with the main thread. This is useful in some cases and since PyParallel managed to keep the same semantics it (shouldn't) break CPyExt. If we keep to the following rules:
No global state mutation is allowed No new keywords or code modifications required No CPyExt code is allowed (for now)
I believe that users can somewhat benefit from this implementation if done correctly. As for implementation, if we can trace the code running in the thread and ensure it's not mutating global state and that CPyExt is never used during the thread's course we can simply release the GIL when such a thread is run. That requires less knowledge than using STM and less code modifications. However I think that attempting to do so will introduce the same issue with caching traces (Armin am I correct here?).
As for CPyExt, we could copy the same code modifications that PyParallels did but I suspect that it will be so slow that the benefit of running in parallel will be completely lost for all cases but very long threads.
Is what I'm suggesting even possible? How challenging will it be?
Thanks, Omer Katz.
_______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev
Let's review what forking does in Python from a 10,000ft view: 1) It pickles the current state of the process. 2) Starts a new Python process 3) Unpickles the current state of the process There are a lot more memory allocations when forking comparing to starting a new thread. That makes forking unsuitable for small workloads. I'm guessing that PyPy does not save the trace/optimized ASM of the forked process in the parent process so each time you start a new process you have to trace again which makes small workloads even less suitable and even large processing batches will need to be traced again. In case of pre-forking servers, each PyPy instance has to trace and optimize the same code when there is no reason. Threads would allow us to reduce warmup time for this case. It will also consume less memory. בתאריך יום ב׳, 20 ביוני 2016 ב-17:47 מאת Maciej Fijalkowski < fijall@gmail.com>:
so quick question - what's the win compared to multiple processes?
Hi all, There was an experiment based on CPython's code called PyParallel that allows running threads in parallel without STM and modifying source code of both Python and C extensions. The only limitation is that they disallow mutation of global state in parallel context. I briefly mentioned it before on PyPy's freenode channel. I'd like to discuss why the approach is useful, how it can benefit PyPy users and how can it be implemented. Allowing to run in parallel without mutating global state can help servers use each thread to handle a request. It can also allow to log in
On Mon, Jun 20, 2016 at 8:51 AM, Omer Katz <omer.drow@gmail.com> wrote: parallel or
send an HTTP request (or an AMQP message) without sharing the response with the main thread. This is useful in some cases and since PyParallel managed to keep the same semantics it (shouldn't) break CPyExt. If we keep to the following rules:
No global state mutation is allowed No new keywords or code modifications required No CPyExt code is allowed (for now)
I believe that users can somewhat benefit from this implementation if done correctly. As for implementation, if we can trace the code running in the thread and ensure it's not mutating global state and that CPyExt is never used during the thread's course we can simply release the GIL when such a thread is run. That requires less knowledge than using STM and less code modifications. However I think that attempting to do so will introduce the same issue with caching traces (Armin am I correct here?).
As for CPyExt, we could copy the same code modifications that PyParallels did but I suspect that it will be so slow that the benefit of running in parallel will be completely lost for all cases but very long threads.
Is what I'm suggesting even possible? How challenging will it be?
Thanks, Omer Katz.
_______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev
no, you misunderstood me: if you want to use multiple processes, you not gonna start a new one per thing to do. You'll have a process pool and use that. Also, if you don't use multiprocessing, you don't use pickling, you use something sane for communication. The PyParallels essentially allows read-only access to the global state, but read-only is ill defined and ill enforced (especially in the case of cpy extensions) in Python. So what do you get as opposed to multiple processing? On Mon, Jun 20, 2016 at 6:42 PM, Omer Katz <omer.drow@gmail.com> wrote:
Let's review what forking does in Python from a 10,000ft view: 1) It pickles the current state of the process. 2) Starts a new Python process 3) Unpickles the current state of the process There are a lot more memory allocations when forking comparing to starting a new thread. That makes forking unsuitable for small workloads. I'm guessing that PyPy does not save the trace/optimized ASM of the forked process in the parent process so each time you start a new process you have to trace again which makes small workloads even less suitable and even large processing batches will need to be traced again.
In case of pre-forking servers, each PyPy instance has to trace and optimize the same code when there is no reason. Threads would allow us to reduce warmup time for this case. It will also consume less memory.
בתאריך יום ב׳, 20 ביוני 2016 ב-17:47 מאת Maciej Fijalkowski <fijall@gmail.com>:
so quick question - what's the win compared to multiple processes?
On Mon, Jun 20, 2016 at 8:51 AM, Omer Katz <omer.drow@gmail.com> wrote:
Hi all, There was an experiment based on CPython's code called PyParallel that allows running threads in parallel without STM and modifying source code of both Python and C extensions. The only limitation is that they disallow mutation of global state in parallel context. I briefly mentioned it before on PyPy's freenode channel. I'd like to discuss why the approach is useful, how it can benefit PyPy users and how can it be implemented. Allowing to run in parallel without mutating global state can help servers use each thread to handle a request. It can also allow to log in parallel or send an HTTP request (or an AMQP message) without sharing the response with the main thread. This is useful in some cases and since PyParallel managed to keep the same semantics it (shouldn't) break CPyExt. If we keep to the following rules:
No global state mutation is allowed No new keywords or code modifications required No CPyExt code is allowed (for now)
I believe that users can somewhat benefit from this implementation if done correctly. As for implementation, if we can trace the code running in the thread and ensure it's not mutating global state and that CPyExt is never used during the thread's course we can simply release the GIL when such a thread is run. That requires less knowledge than using STM and less code modifications. However I think that attempting to do so will introduce the same issue with caching traces (Armin am I correct here?).
As for CPyExt, we could copy the same code modifications that PyParallels did but I suspect that it will be so slow that the benefit of running in parallel will be completely lost for all cases but very long threads.
Is what I'm suggesting even possible? How challenging will it be?
Thanks, Omer Katz.
_______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev
In preforking (as in the case of a process pool), you use less memory and reduce the tracing/optimization time on the code since the same PyPy instance already traced and optimized that part of the code. בתאריך יום ב׳, 20 ביוני 2016 ב-20:16 מאת Maciej Fijalkowski < fijall@gmail.com>:
no, you misunderstood me:
if you want to use multiple processes, you not gonna start a new one per thing to do. You'll have a process pool and use that. Also, if you don't use multiprocessing, you don't use pickling, you use something sane for communication. The PyParallels essentially allows read-only access to the global state, but read-only is ill defined and ill enforced (especially in the case of cpy extensions) in Python. So what do you get as opposed to multiple processing?
On Mon, Jun 20, 2016 at 6:42 PM, Omer Katz <omer.drow@gmail.com> wrote:
Let's review what forking does in Python from a 10,000ft view: 1) It pickles the current state of the process. 2) Starts a new Python process 3) Unpickles the current state of the process There are a lot more memory allocations when forking comparing to starting a new thread. That makes forking unsuitable for small workloads. I'm guessing that PyPy does not save the trace/optimized ASM of the forked process in the parent process so each time you start a new process you have to trace again which makes small workloads even less suitable and even large processing batches will need to be traced again.
In case of pre-forking servers, each PyPy instance has to trace and optimize the same code when there is no reason. Threads would allow us to reduce warmup time for this case. It will also consume less memory.
בתאריך יום ב׳, 20 ביוני 2016 ב-17:47 מאת Maciej Fijalkowski <fijall@gmail.com>:
so quick question - what's the win compared to multiple processes?
On Mon, Jun 20, 2016 at 8:51 AM, Omer Katz <omer.drow@gmail.com> wrote:
Hi all, There was an experiment based on CPython's code called PyParallel that allows running threads in parallel without STM and modifying source
code
of both Python and C extensions. The only limitation is that they disallow mutation of global state in parallel context. I briefly mentioned it before on PyPy's freenode channel. I'd like to discuss why the approach is useful, how it can benefit PyPy users and how can it be implemented. Allowing to run in parallel without mutating global state can help servers use each thread to handle a request. It can also allow to log in parallel or send an HTTP request (or an AMQP message) without sharing the response with the main thread. This is useful in some cases and since PyParallel managed to keep the same semantics it (shouldn't) break CPyExt. If we keep to the following rules:
No global state mutation is allowed No new keywords or code modifications required No CPyExt code is allowed (for now)
I believe that users can somewhat benefit from this implementation if done correctly. As for implementation, if we can trace the code running in the thread and ensure it's not mutating global state and that CPyExt is never used during the thread's course we can simply release the GIL when such a thread is run. That requires less knowledge than using STM and less code modifications. However I think that attempting to do so will introduce the same issue with caching traces (Armin am I correct here?).
As for CPyExt, we could copy the same code modifications that PyParallels did but I suspect that it will be so slow that the benefit of running in parallel will be completely lost for all cases but very long threads.
Is what I'm suggesting even possible? How challenging will it be?
Thanks, Omer Katz.
_______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev
Hi Omer, On 20 June 2016 at 08:51, Omer Katz <omer.drow@gmail.com> wrote:
As for implementation, if we can trace the code running in the thread and ensure it's not mutating global state and that CPyExt is never used during the thread's course we can simply release the GIL when such a thread is run.
That's a very hand-wavy and vague description. To start with, how do you define exactly "not mutating global state"? We are not allowed to write to any of the objects that existed before we started the thread? It may be possible to have such an implementation, yes. Actually, that's probably easy: tweak the STM code to crash instead of doing something more complicated when we write to an old object. I'm not sure how useful that would be---or how useful PyParallel is on CPython. Maybe if you can point us to real usages of PyParallel it would be a start. A bientôt, Armin.
So I actually thought about similar approach. I was curious what do you think about approach to concurrency similar to what Apple did with C blocks and GCD. That is: enable threading but instead of the STM approach have fully explicit mutations within atomic blocks 2016-06-20 16:53 GMT+02:00 Armin Rigo <arigo@tunes.org>:
Hi Omer,
On 20 June 2016 at 08:51, Omer Katz <omer.drow@gmail.com> wrote:
As for implementation, if we can trace the code running in the thread and ensure it's not mutating global state and that CPyExt is never used during the thread's course we can simply release the GIL when such a thread is run.
That's a very hand-wavy and vague description. To start with, how do you define exactly "not mutating global state"? We are not allowed to write to any of the objects that existed before we started the thread? It may be possible to have such an implementation, yes. Actually, that's probably easy: tweak the STM code to crash instead of doing something more complicated when we write to an old object.
I'm not sure how useful that would be---or how useful PyParallel is on CPython. Maybe if you can point us to real usages of PyParallel it would be a start.
A bientôt,
Armin. _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev
-- --------------------------- Michał Domański
I’m not familiar with C blocks and GCD. How would Python code look with that approach?
On 20 Jun 2016, at 6:02 PM, Michał Domański <mdomans@gmail.com> wrote:
So I actually thought about similar approach. I was curious what do you think about approach to concurrency similar to what Apple did with C blocks and GCD. That is: enable threading but instead of the STM approach have fully explicit mutations within atomic blocks
2016-06-20 16:53 GMT+02:00 Armin Rigo <arigo@tunes.org <mailto:arigo@tunes.org>>: Hi Omer,
On 20 June 2016 at 08:51, Omer Katz <omer.drow@gmail.com <mailto:omer.drow@gmail.com>> wrote:
As for implementation, if we can trace the code running in the thread and ensure it's not mutating global state and that CPyExt is never used during the thread's course we can simply release the GIL when such a thread is run.
That's a very hand-wavy and vague description. To start with, how do you define exactly "not mutating global state"? We are not allowed to write to any of the objects that existed before we started the thread? It may be possible to have such an implementation, yes. Actually, that's probably easy: tweak the STM code to crash instead of doing something more complicated when we write to an old object.
I'm not sure how useful that would be---or how useful PyParallel is on CPython. Maybe if you can point us to real usages of PyParallel it would be a start.
A bientôt,
Armin. _______________________________________________ pypy-dev mailing list pypy-dev@python.org <mailto:pypy-dev@python.org> https://mail.python.org/mailman/listinfo/pypy-dev <https://mail.python.org/mailman/listinfo/pypy-dev>
-- --------------------------- Michał Domański
C blocks are very similar to Go function literals or Ecmas6 arrow functions. I was thinking on either using {} or just having it through identation with (): So instead of: def function(): def inner(): # something loop.call(inner) You'd do: def function(): loop.call( lambda (arg1, arg2, **kwargs) : # something ) Also, we don't have real CSP - since we have GIL all implementations are really just toying with the ideas. The pain point I'm talking about here is such - we introduce a lot of ways to structure code, but they don't give the benefits present in other languages. I'm starting to think we have too much cruft in .py 2016-06-21 16:17 GMT+02:00 Omer Katz <omer.drow@gmail.com>:
I’m not familiar with C blocks and GCD. How would Python code look with that approach?
On 20 Jun 2016, at 6:02 PM, Michał Domański <mdomans@gmail.com> wrote:
So I actually thought about similar approach. I was curious what do you think about approach to concurrency similar to what Apple did with C blocks and GCD. That is: enable threading but instead of the STM approach have fully explicit mutations within atomic blocks
2016-06-20 16:53 GMT+02:00 Armin Rigo <arigo@tunes.org>:
Hi Omer,
On 20 June 2016 at 08:51, Omer Katz <omer.drow@gmail.com> wrote:
As for implementation, if we can trace the code running in the thread and ensure it's not mutating global state and that CPyExt is never used during the thread's course we can simply release the GIL when such a thread is run.
That's a very hand-wavy and vague description. To start with, how do you define exactly "not mutating global state"? We are not allowed to write to any of the objects that existed before we started the thread? It may be possible to have such an implementation, yes. Actually, that's probably easy: tweak the STM code to crash instead of doing something more complicated when we write to an old object.
I'm not sure how useful that would be---or how useful PyParallel is on CPython. Maybe if you can point us to real usages of PyParallel it would be a start.
A bientôt,
Armin. _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev
-- --------------------------- Michał Domański
-- --------------------------- Michał Domański
PyParallel defines "not mutating global state" as *"avoiding mutation of Python objects that were allocated from the main thread; don't append to a main thread list or assign to a main thread dict from a parallel thread"*. The PyParallel approach provides different tradeoffs from STM. You can't parallelize desrialization of a dictionary to a Python object instance e.g. a Django model but you can run a threaded server that performs parallel I/O since in STM performing I/O turns the transaction to be inevitable. There can only be one inevitable transaction at any given point of time according to the documentation found here http://doc.pypy.org/en/latest/stm.html#transaction-transactionqueue. Also, I'm not sure how allowing to perform a single I/O operation in PyPy STM will affect gevent/eventlet or asyncio if more than one thread is involved (which is supported in both gevent and asyncio. I haven't used eventlet so I don't really know). The PyParallel approach offers the same semantics as CPython when it comes to gevent/asyncio/eventlet. Each thread has it's own event loop and you are allowed to switch execution in the middle since you're not changing anything from other threads. You can also report errors to Sentry using raven while handling other requests normally. Raven collects stack information which is never mutated (See https://github.com/getsentry/raven-python/blob/master/raven/utils/stacks.py#...) and then sends it to Sentry's servers. There's no reason (that I can see at least) to block another request from being processed while collecting that information and sending the data to Sentry's servers. The usecase described by PyParallel is also valid: "...This is significant when you factor in how Python's scoping works at a language level: Python code executing in a parallel thread can freely access any non-local variables created by the "main thread". That is, it has the exact same scoping and variable name resolution rules as any other Python code. This facilitates loading large data structures from the main thread and then freely accessing them from parallel callbacks. We demonstrate this with our simple Wikipedia "instant search" server <https://github.com/pyparallel/pyparallel/blob/branches/3.3-px/examples/wiki/wiki.py#L294>, which loads a trie with 27 million entries, each one mapping a title to a 64-bit byte offset within a 60GB XML file. We then load a sorted NumPy array of all 64-bit offsets, which allows us to extract the exact byte range a given title's content appears within the XML file, allowing a client to issue a ranged request for those bytes to get the exact content via a single call to TransmitFile. This call returns immediately, but sets up the necessary structures for the kernel to send that byte range directly to the client without further interaction from us. The working set size of the python.exe process is about 11GB when the trie and NumPy array are loaded. Thus,multiprocessing would not be feasible, as you'd have 8 separate processes of 11GB if you had 8 cores and started 8 workers, requiring 88GB just for the processes. The number of allocated objects is around 27.1 million; the datrie library can efficiently store values if they're a 32-bit integer, however, our offsets are 64-bit, so an 80-something byte PyObjectneeds to be allocated to represent each one. This is significant because it demonstrates the competitive advantage PyParallel has against other languages when dealing with large heap sizes and object counts, whilst simultaneously avoiding the need for continual GC-motivated heap traversal, a product of memory allocation pressure (which is an inevitable side-effect of high-end network load, where incoming links are saturated at line rate)." STM currently requires code modifications in order to avoid conflicts, at least when collections are involved. PyParallel doesn't allow these kinds of mutations so it makes the implementation much easier in PyPy. PyParallel also requires a specific API to be used in order to utilize their parallel threads. There is a way to eliminate code modifications in PyParallel's case. We initially run with the GIL acquired as in with any other thread and then the trace for CPyExt calls or non-thread locals mutations and if there are none we can eliminate the call to acquire the GIL. Further optimizations can be performed if only a branch of the code requires CPyExt/non-thread locals mutations. I don't know if it's any easier than scanning the trace for lists/sets/dictionaries and replacing them with their equivalent STM implementations which Armin has already mentioned is not trivial. In the future when STM will be production ready we can "downgrade" a thread to an STM thread when it is required instead of acquiring the GIL and blocking the execution of other threads if we want to. STM also currently makes it harder to reason on how the program behaves. Especially when you have conflicts. With my suggestion you can easily say if the GIL is released or not. בתאריך יום ב׳, 20 ביוני 2016 ב-17:53 מאת Armin Rigo <arigo@tunes.org >:
Hi Omer,
On 20 June 2016 at 08:51, Omer Katz <omer.drow@gmail.com> wrote:
As for implementation, if we can trace the code running in the thread and ensure it's not mutating global state and that CPyExt is never used during the thread's course we can simply release the GIL when such a thread is run.
That's a very hand-wavy and vague description. To start with, how do you define exactly "not mutating global state"? We are not allowed to write to any of the objects that existed before we started the thread? It may be possible to have such an implementation, yes. Actually, that's probably easy: tweak the STM code to crash instead of doing something more complicated when we write to an old object.
I'm not sure how useful that would be---or how useful PyParallel is on CPython. Maybe if you can point us to real usages of PyParallel it would be a start.
A bientôt,
Armin.
With all due respect - wouldn't it make more sense to agree on API that in fact would use threads? The patterns used by Go and Erlang and to some degree by ObjC and Swift seem more promising than "hidden multiprocessing" I may be a bore, but if what I'm getting is just "a nice syntax with restrictions" - it's not worth working on. I'd like to see actual benefits to people that want multithreading to work. It may seem like a blasphemy but with PyPy we could agree on a new APIs 2016-06-20 19:53 GMT+02:00 Omer Katz <omer.drow@gmail.com>:
PyParallel defines "not mutating global state" as *"avoiding mutation of Python objects that were allocated from the main thread; don't append to a main thread list or assign to a main thread dict from a parallel thread"*.
The PyParallel approach provides different tradeoffs from STM. You can't parallelize desrialization of a dictionary to a Python object instance e.g. a Django model but you can run a threaded server that performs parallel I/O since in STM performing I/O turns the transaction to be inevitable. There can only be one inevitable transaction at any given point of time according to the documentation found here http://doc.pypy.org/en/latest/stm.html#transaction-transactionqueue. Also, I'm not sure how allowing to perform a single I/O operation in PyPy STM will affect gevent/eventlet or asyncio if more than one thread is involved (which is supported in both gevent and asyncio. I haven't used eventlet so I don't really know). The PyParallel approach offers the same semantics as CPython when it comes to gevent/asyncio/eventlet. Each thread has it's own event loop and you are allowed to switch execution in the middle since you're not changing anything from other threads.
You can also report errors to Sentry using raven while handling other requests normally. Raven collects stack information which is never mutated (See https://github.com/getsentry/raven-python/blob/master/raven/utils/stacks.py#...) and then sends it to Sentry's servers. There's no reason (that I can see at least) to block another request from being processed while collecting that information and sending the data to Sentry's servers.
The usecase described by PyParallel is also valid:
"...This is significant when you factor in how Python's scoping works at a language level: Python code executing in a parallel thread can freely access any non-local variables created by the "main thread". That is, it has the exact same scoping and variable name resolution rules as any other Python code. This facilitates loading large data structures from the main thread and then freely accessing them from parallel callbacks.
We demonstrate this with our simple Wikipedia "instant search" server <https://github.com/pyparallel/pyparallel/blob/branches/3.3-px/examples/wiki/wiki.py#L294>, which loads a trie with 27 million entries, each one mapping a title to a 64-bit byte offset within a 60GB XML file. We then load a sorted NumPy array of all 64-bit offsets, which allows us to extract the exact byte range a given title's content appears within the XML file, allowing a client to issue a ranged request for those bytes to get the exact content via a single call to TransmitFile. This call returns immediately, but sets up the necessary structures for the kernel to send that byte range directly to the client without further interaction from us.
The working set size of the python.exe process is about 11GB when the trie and NumPy array are loaded. Thus,multiprocessing would not be feasible, as you'd have 8 separate processes of 11GB if you had 8 cores and started 8 workers, requiring 88GB just for the processes. The number of allocated objects is around 27.1 million; the datrie library can efficiently store values if they're a 32-bit integer, however, our offsets are 64-bit, so an 80-something byte PyObjectneeds to be allocated to represent each one.
This is significant because it demonstrates the competitive advantage PyParallel has against other languages when dealing with large heap sizes and object counts, whilst simultaneously avoiding the need for continual GC-motivated heap traversal, a product of memory allocation pressure (which is an inevitable side-effect of high-end network load, where incoming links are saturated at line rate)."
STM currently requires code modifications in order to avoid conflicts, at least when collections are involved. PyParallel doesn't allow these kinds of mutations so it makes the implementation much easier in PyPy. PyParallel also requires a specific API to be used in order to utilize their parallel threads. There is a way to eliminate code modifications in PyParallel's case. We initially run with the GIL acquired as in with any other thread and then the trace for CPyExt calls or non-thread locals mutations and if there are none we can eliminate the call to acquire the GIL. Further optimizations can be performed if only a branch of the code requires CPyExt/non-thread locals mutations. I don't know if it's any easier than scanning the trace for lists/sets/dictionaries and replacing them with their equivalent STM implementations which Armin has already mentioned is not trivial. In the future when STM will be production ready we can "downgrade" a thread to an STM thread when it is required instead of acquiring the GIL and blocking the execution of other threads if we want to.
STM also currently makes it harder to reason on how the program behaves. Especially when you have conflicts. With my suggestion you can easily say if the GIL is released or not. בתאריך יום ב׳, 20 ביוני 2016 ב-17:53 מאת Armin Rigo <arigo@tunes.org >:
Hi Omer,
On 20 June 2016 at 08:51, Omer Katz <omer.drow@gmail.com> wrote:
As for implementation, if we can trace the code running in the thread and ensure it's not mutating global state and that CPyExt is never used during the thread's course we can simply release the GIL when such a thread is run.
That's a very hand-wavy and vague description. To start with, how do you define exactly "not mutating global state"? We are not allowed to write to any of the objects that existed before we started the thread? It may be possible to have such an implementation, yes. Actually, that's probably easy: tweak the STM code to crash instead of doing something more complicated when we write to an old object.
I'm not sure how useful that would be---or how useful PyParallel is on CPython. Maybe if you can point us to real usages of PyParallel it would be a start.
A bientôt,
Armin.
_______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev
-- --------------------------- Michał Domański
CSP is something you can already implement with Python. In fact, that's exactly what happens when one uses Python threads with coroutines (such as gevent). I'm not sure what your suggesting and how would they keep Python semantics. The limitations of the GIL will prevent CSP style concurrency from actually performing as well as Go. What I'm suggesting will relex the limitations of the GIL without changing semantics or requiring a battle tested STM implementation for the time being. The use cases I described will benefit from having threads working. Applications like Salt use a lot of threads. If we can run some of them in parallel without changing code, that's a huge win in my book. On Mon, Jun 20, 2016, 22:56 Michał Domański <mdomans@gmail.com> wrote:
With all due respect - wouldn't it make more sense to agree on API that in fact would use threads? The patterns used by Go and Erlang and to some degree by ObjC and Swift seem more promising than "hidden multiprocessing"
I may be a bore, but if what I'm getting is just "a nice syntax with restrictions" - it's not worth working on. I'd like to see actual benefits to people that want multithreading to work. It may seem like a blasphemy but with PyPy we could agree on a new APIs
2016-06-20 19:53 GMT+02:00 Omer Katz <omer.drow@gmail.com>:
PyParallel defines "not mutating global state" as *"avoiding mutation of Python objects that were allocated from the main thread; don't append to a main thread list or assign to a main thread dict from a parallel thread"* .
The PyParallel approach provides different tradeoffs from STM. You can't parallelize desrialization of a dictionary to a Python object instance e.g. a Django model but you can run a threaded server that performs parallel I/O since in STM performing I/O turns the transaction to be inevitable. There can only be one inevitable transaction at any given point of time according to the documentation found here http://doc.pypy.org/en/latest/stm.html#transaction-transactionqueue. Also, I'm not sure how allowing to perform a single I/O operation in PyPy STM will affect gevent/eventlet or asyncio if more than one thread is involved (which is supported in both gevent and asyncio. I haven't used eventlet so I don't really know). The PyParallel approach offers the same semantics as CPython when it comes to gevent/asyncio/eventlet. Each thread has it's own event loop and you are allowed to switch execution in the middle since you're not changing anything from other threads.
You can also report errors to Sentry using raven while handling other requests normally. Raven collects stack information which is never mutated (See https://github.com/getsentry/raven-python/blob/master/raven/utils/stacks.py#...) and then sends it to Sentry's servers. There's no reason (that I can see at least) to block another request from being processed while collecting that information and sending the data to Sentry's servers.
The usecase described by PyParallel is also valid:
"...This is significant when you factor in how Python's scoping works at a language level: Python code executing in a parallel thread can freely access any non-local variables created by the "main thread". That is, it has the exact same scoping and variable name resolution rules as any other Python code. This facilitates loading large data structures from the main thread and then freely accessing them from parallel callbacks.
We demonstrate this with our simple Wikipedia "instant search" server <https://github.com/pyparallel/pyparallel/blob/branches/3.3-px/examples/wiki/wiki.py#L294>, which loads a trie with 27 million entries, each one mapping a title to a 64-bit byte offset within a 60GB XML file. We then load a sorted NumPy array of all 64-bit offsets, which allows us to extract the exact byte range a given title's content appears within the XML file, allowing a client to issue a ranged request for those bytes to get the exact content via a single call to TransmitFile. This call returns immediately, but sets up the necessary structures for the kernel to send that byte range directly to the client without further interaction from us.
The working set size of the python.exe process is about 11GB when the trie and NumPy array are loaded. Thus,multiprocessing would not be feasible, as you'd have 8 separate processes of 11GB if you had 8 cores and started 8 workers, requiring 88GB just for the processes. The number of allocated objects is around 27.1 million; the datrie library can efficiently store values if they're a 32-bit integer, however, our offsets are 64-bit, so an 80-something byte PyObjectneeds to be allocated to represent each one.
This is significant because it demonstrates the competitive advantage PyParallel has against other languages when dealing with large heap sizes and object counts, whilst simultaneously avoiding the need for continual GC-motivated heap traversal, a product of memory allocation pressure (which is an inevitable side-effect of high-end network load, where incoming links are saturated at line rate)."
STM currently requires code modifications in order to avoid conflicts, at least when collections are involved. PyParallel doesn't allow these kinds of mutations so it makes the implementation much easier in PyPy. PyParallel also requires a specific API to be used in order to utilize their parallel threads. There is a way to eliminate code modifications in PyParallel's case. We initially run with the GIL acquired as in with any other thread and then the trace for CPyExt calls or non-thread locals mutations and if there are none we can eliminate the call to acquire the GIL. Further optimizations can be performed if only a branch of the code requires CPyExt/non-thread locals mutations. I don't know if it's any easier than scanning the trace for lists/sets/dictionaries and replacing them with their equivalent STM implementations which Armin has already mentioned is not trivial. In the future when STM will be production ready we can "downgrade" a thread to an STM thread when it is required instead of acquiring the GIL and blocking the execution of other threads if we want to.
STM also currently makes it harder to reason on how the program behaves. Especially when you have conflicts. With my suggestion you can easily say if the GIL is released or not. בתאריך יום ב׳, 20 ביוני 2016 ב-17:53 מאת Armin Rigo <arigo@tunes.org >:
Hi Omer,
On 20 June 2016 at 08:51, Omer Katz <omer.drow@gmail.com> wrote:
As for implementation, if we can trace the code running in the thread and ensure it's not mutating global state and that CPyExt is never used during the thread's course we can simply release the GIL when such a thread is run.
That's a very hand-wavy and vague description. To start with, how do you define exactly "not mutating global state"? We are not allowed to write to any of the objects that existed before we started the thread? It may be possible to have such an implementation, yes. Actually, that's probably easy: tweak the STM code to crash instead of doing something more complicated when we write to an old object.
I'm not sure how useful that would be---or how useful PyParallel is on CPython. Maybe if you can point us to real usages of PyParallel it would be a start.
A bientôt,
Armin.
_______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev
-- --------------------------- Michał Domański
participants (4)
-
Armin Rigo
-
Maciej Fijalkowski
-
Michał Domański
-
Omer Katz