
On Wed, Jun 24, 2015 at 04:55:31PM -0700, Nathaniel Smith wrote:
On Wed, Jun 24, 2015 at 3:10 PM, Devin Jeanpierre <jeanpierreda@gmail.com> wrote:
So there's two reasons I can think of to use threads for CPU parallelism:
- My thing does a lot of parallel work, and so I want to save on memory by sharing an address space
This only becomes an especially pressing concern if you start running tens of thousands or more of workers. Fork also allows this.
Not necessarily true... e.g., see two threads from yesterday (!) on the pandas mailing list, from users who want to perform queries against a large data structure shared between threads/processes:
https://groups.google.com/d/msg/pydata/Emkkk9S9rUk/eh0nfiGR7O0J https://groups.google.com/forum/#!topic/pydata/wOwe21I65-I ("Are we just screwed on windows?")
Ironically (not knowing anything about Pandas' implementation details other than... "Cython... and NumPy"), there should be no difference between getting a Pandas DataFrame available to PyParallel and a NumPy ndarray or Cythonized C-struct (like datrie). The situation Ryan describes is literally the exact situation that PyParallel excels at: large reference data structures accessible in parallel contexts. Trent.