concurrent file reading/writing using python

Tue Mar 27 02:08:08 EDT 2012

Thanks for the advice Dennis.

@Steve : I haven't actually written the code. I was thinking more on
the generic side and wanted to check if what I thought made sense and
I now realize it can depend on then the I/O.  For starters I was just
thinking about counting lines in a line without doing any computation
so this can be strictly I/O bound.

I guess what I need to ask was can we improve on the existing disk I/O
performance by reading different portions of the file using threads or
processes. I am kind of pointing towards a MapReduce task on a file in
a shared file system such as GPFS(from IBM). I realize this can be
more suited to HDFS but wanted to know if people have implemented
something similar on a normal linux based NFS

-Abhi

On Mon, Mar 26, 2012 at 6:44 PM, Steve Howell <showell30 at yahoo.com> wrote:
> On Mar 26, 3:56 pm, Abhishek Pratap <abhishek.... at gmail.com> wrote:
>> Hi Guys
>>
>> I am fwding this question from the python tutor list in the hope of
>> reaching more people experienced in concurrent disk access in python.
>>
>> I am trying to see if there are ways in which I can read a big file
>> concurrently on a multi core server and process data and write the
>> output to a single file as the data is processed.
>>
>> For example if I have a 50Gb file, I would like to read it in parallel
>> with 10 process/thread, each working on a 10Gb data and perform the
>> same data parallel computation on each chunk of fine collating the
>> output to a single file.
>>
>> I will appreciate your feedback. I did find some threads about this on
>> stackoverflow but it was not clear to me what would be a good  way to
>> go about implementing this.
>>
>
> Have you written a single-core solution to your problem?  If so, can
> you post the code here?
>
> If CPU isn't your primary bottleneck, then you need to be careful not
> to overly complicate your solution by getting multiple cores
> involved.  All the coordination might make your program slower and
> more buggy.
>
> If CPU is the primary bottleneck, then you might want to consider an
> approach where you only have a single thread that's reading records
> from the file, 10 at a time, and then dispatching out the calculations
> to different threads, then writing results back to disk.
>
> My approach would be something like this:
>
>  1) Take a small sample of your dataset so that you can process it
> within 10 seconds or so using a simple, single-core program.
>  2) Figure out whether you're CPU bound.  A simple way to do this is
> to comment out the actual computation or replace it with a trivial
> stub.  If you're CPU bound, the program will run much faster.  If
> you're IO-bound, the program won't run much faster (since all the work
> is actually just reading from disk).
>  3) Figure out how to read 10 records at a time and farm out the
> records to threads.  Hopefully, your program will take significantly
> less time.  At this point, don't obsess over collating data.  It might
> not be 10 times as fast, but it should be somewhat faster to be worth
> your while.
>  4) If the threaded approach shows promise, make sure that you can
> still generate correct output with that approach (in other words,
> figure out out synchronization and collating).
>
> At the end of that experiment, you should have a better feel on where
> to go next.
>
> What is the nature of your computation?  Maybe it would be easier to
> tune the algorithm then figure out the multi-core optimization.
>
>
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list