concurrent file reading/writing using python
abhishek.vit at gmail.com
Tue Mar 27 08:08:08 CEST 2012
Thanks for the advice Dennis.
@Steve : I haven't actually written the code. I was thinking more on
the generic side and wanted to check if what I thought made sense and
I now realize it can depend on then the I/O. For starters I was just
thinking about counting lines in a line without doing any computation
so this can be strictly I/O bound.
I guess what I need to ask was can we improve on the existing disk I/O
performance by reading different portions of the file using threads or
processes. I am kind of pointing towards a MapReduce task on a file in
a shared file system such as GPFS(from IBM). I realize this can be
more suited to HDFS but wanted to know if people have implemented
something similar on a normal linux based NFS
On Mon, Mar 26, 2012 at 6:44 PM, Steve Howell <showell30 at yahoo.com> wrote:
> On Mar 26, 3:56 pm, Abhishek Pratap <abhishek.... at gmail.com> wrote:
>> Hi Guys
>> I am fwding this question from the python tutor list in the hope of
>> reaching more people experienced in concurrent disk access in python.
>> I am trying to see if there are ways in which I can read a big file
>> concurrently on a multi core server and process data and write the
>> output to a single file as the data is processed.
>> For example if I have a 50Gb file, I would like to read it in parallel
>> with 10 process/thread, each working on a 10Gb data and perform the
>> same data parallel computation on each chunk of fine collating the
>> output to a single file.
>> I will appreciate your feedback. I did find some threads about this on
>> stackoverflow but it was not clear to me what would be a good way to
>> go about implementing this.
> Have you written a single-core solution to your problem? If so, can
> you post the code here?
> If CPU isn't your primary bottleneck, then you need to be careful not
> to overly complicate your solution by getting multiple cores
> involved. All the coordination might make your program slower and
> more buggy.
> If CPU is the primary bottleneck, then you might want to consider an
> approach where you only have a single thread that's reading records
> from the file, 10 at a time, and then dispatching out the calculations
> to different threads, then writing results back to disk.
> My approach would be something like this:
> 1) Take a small sample of your dataset so that you can process it
> within 10 seconds or so using a simple, single-core program.
> 2) Figure out whether you're CPU bound. A simple way to do this is
> to comment out the actual computation or replace it with a trivial
> stub. If you're CPU bound, the program will run much faster. If
> you're IO-bound, the program won't run much faster (since all the work
> is actually just reading from disk).
> 3) Figure out how to read 10 records at a time and farm out the
> records to threads. Hopefully, your program will take significantly
> less time. At this point, don't obsess over collating data. It might
> not be 10 times as fast, but it should be somewhat faster to be worth
> your while.
> 4) If the threaded approach shows promise, make sure that you can
> still generate correct output with that approach (in other words,
> figure out out synchronization and collating).
> At the end of that experiment, you should have a better feel on where
> to go next.
> What is the nature of your computation? Maybe it would be easier to
> tune the algorithm then figure out the multi-core optimization.
More information about the Python-list