Parallel/Multiprocessing script design question

Amit N nospam at 23342.om
Thu Sep 13 01:20:36 EDT 2007


Hi guys,

I tend to ramble, and I am afraid none of you busy experts will bother 
reading my long post, so I will try to summarize it first:

1. I have a script that processes ~10GB of data daily, and runs for a long 
time that I need to parallelize on a multicpu/multicore system. I am trying 
to decide on a module/toolkit that would help me create a multiprocessing 
solution but there are so many of them that I can't decide what to use. I am 
looking for a cross platform solution. Although right now it has to work in 
windows first, so many of the fork based modules are out. I am hoping people 
with experience using any of these would chime in with tips. The main thing 
I would look for in a toolkit is maturity and no extra dependencies. Plus a 
wide user community is always good. 
POSH/parallelpython/mpi4py/pyPar/Kamaelia/Twisted I am so confused :(

2. The processing involves multiple steps that each input file has to go 
through. I am trying to decide between a batch mode design and a pipelined 
design for concurrency. In the batched design, all files will be processed 
on one processing step(in parallel) before the next step is started. In a 
pipelined design, each file will be taken through all steps to the end. So 
multiple files will be in parallel pipelines at the same time. I can't 
decide which is better. I guess I am asking for experienced eyes to take a 
look at the alternatives, for things that I, making my very first concurrent 
design, won't see.

DETAILS:

I have been trying to choose a design for this project but am striken by my 
usual case of analysis paralysis.

I had decided to learn Python about 3 weeks ago specifically for this 
project, as it needed parsing and text processing, not realizing that I 
would need concurrency. I am having the same trouble in deciding which 
parser generator to use, but I will ask about parsing in a separate thread 
to keep this focused.

It was slow, so I tried to run a multithreaded version, naively expecting a 
2x speedup. I barely got a 5% improvement and only then learned about the 
GIL. I guess I still haven't got too much time invested in this, so I can 
still switch to another language. I am not sure which other scripting 
languages have real multithreading? Perl? But I had chosen Python over Perl 
for readability and maintainability and am not ready to give that up yet. I 
know about stackless/Ironpython/Jython but I want to stick to CPython. So I 
am going to try to figure this out.

Even after deciding to go for a SMP solution, I still don't know which 
toolkit to use. The subprocess module should allow spawning new processes, 
but I am not sure how to get status/error codes back from those? I guess 
this is why people made those parallel processing modules that might help by 
taking care of these things.  I think my application is fairly simple and 
should be easy to SMP.

THE TASK:

About 800+ 10-15MB files are generated daily that need to be processed. The 
processing consists of different steps that the files must go through:

-Uncompress
-FilterA
-FilterB
-Parse
-Possibly compress parsed files for archival

All files have to be run through each of the two filters. The two filters 
are independent of each other and produce output files that need separate 
parsers. So they can in fact run in parallel, and so can the subsequent 
parsers. Furthermore, multiple files can be running in parallel inside each 
step. Eg. 4 files being uncompressed at the same time. I am using the python 
library for uncompressing and will be doing the parsing in Python too. But 
the two filters are external console programs that I spawn in the system 
shell with subprocess.call(). I guess I can forget about communicating with 
those?

The first method that came to mind was to finish each step on all files 
before going to the next. So all files are uncompressed first, using 
multiple processes in parallel. Then all files are filtered in parallel, 
etc. I guess I would need some sort of queuing system here, to submit files 
to the CPUs properly?

The other way could be to have each individual file run through all the 
steps and have multiple such "pipelines" running simultaneously in parallel. 
It feels like this method will lose cache performance because all the code 
for all the steps will be loaded at the same time, but I am not sure if I 
should be worrying about that. This will have the advantage of "Fast 
First-Out" which means that something waiting for the results of processing 
won't have to wait till the very end. They can start receiving data 
incrementally from the start(kind of streaming?). Pipelined mode may also 
help to rerun an individual file quickly in case it had an error.
So whats the better method?

EVALUATIONS:

POSH - Doesn't seem mature, was supposed to be proof of concept only. People 
have reported Bugs/Problems using it. POSIX Only.
delegate/forkmap/pprocess - fork based, POSIX only
ParallelPython - Seems to meet all criteria, and is cross platform. I will 
be trying this one first.
remoteD - Claims to be platform independent, but I don't think so. Code 
shows os.fork only. Last updated 2004 v0.8
processing - Is in beta V0.33 but looks promising and is cross platform. 
Emulates processes as threads. http://www.python.org/pypi/processing

MPI based modules(probably overkill for my application):

pyPar - Mature, cross platform. Has a dependency on Numeric Python + needs a 
C compiler.
pyMpi - POSIX only . Alpha status. From lawrence livermore labs. It modifies 
the interpreter itself to make it multi-noded.
mpi4py - ? another MPI implementation.

LINKS & DISCUSSIONS

http://wiki.python.org/moin/ParallelProcessing
http://blog.ianbicking.org/gil-of-doom.html
http://www.usenix.org/events/hotos03/tech/full_papers/vonbehren/vonbehren_html/index.html
http://groups.google.com/group/comp.lang.python/browse_thread/thread/1f5d927d34f8f323/
http://groups.google.com/group/comp.lang.python/browse_frm/thread/332083cdc8bc44b/
http://groups.google.com/group/comp.lang.python/browse_frm/thread/13da24f2d6dc24a9/
http://groups.google.com/group/comp.lang.python/browse_thread/thread/f822ec289f30b26a/
http://groups.google.com/group/comp.lang.python/browse_thread/thread/902dbddfc31b8891
http://groups.google.com/group/comp.lang.python/browse_thread/thread/d8fa9ad770c17c70/





More information about the Python-list mailing list