concurrent file reading/writing using python

Mon Mar 26 21:44:30 EDT 2012

On Mar 26, 3:56 pm, Abhishek Pratap <abhishek.... at gmail.com> wrote:
> Hi Guys
>
> I am fwding this question from the python tutor list in the hope of
> reaching more people experienced in concurrent disk access in python.
>
> I am trying to see if there are ways in which I can read a big file
> concurrently on a multi core server and process data and write the
> output to a single file as the data is processed.
>
> For example if I have a 50Gb file, I would like to read it in parallel
> with 10 process/thread, each working on a 10Gb data and perform the
> same data parallel computation on each chunk of fine collating the
> output to a single file.
>
> I will appreciate your feedback. I did find some threads about this on
> stackoverflow but it was not clear to me what would be a good  way to
> go about implementing this.
>

Have you written a single-core solution to your problem?  If so, can
you post the code here?

If CPU isn't your primary bottleneck, then you need to be careful not
to overly complicate your solution by getting multiple cores
involved.  All the coordination might make your program slower and
more buggy.

If CPU is the primary bottleneck, then you might want to consider an
approach where you only have a single thread that's reading records
from the file, 10 at a time, and then dispatching out the calculations
to different threads, then writing results back to disk.

My approach would be something like this:

  1) Take a small sample of your dataset so that you can process it
within 10 seconds or so using a simple, single-core program.
  2) Figure out whether you're CPU bound.  A simple way to do this is
to comment out the actual computation or replace it with a trivial
stub.  If you're CPU bound, the program will run much faster.  If
you're IO-bound, the program won't run much faster (since all the work
is actually just reading from disk).
  3) Figure out how to read 10 records at a time and farm out the
records to threads.  Hopefully, your program will take significantly
less time.  At this point, don't obsess over collating data.  It might
not be 10 times as fast, but it should be somewhat faster to be worth
your while.
  4) If the threaded approach shows promise, make sure that you can
still generate correct output with that approach (in other words,
figure out out synchronization and collating).

At the end of that experiment, you should have a better feel on where
to go next.

What is the nature of your computation?  Maybe it would be easier to
tune the algorithm then figure out the multi-core optimization.