[Tutor] using multiprocessing efficiently to process large data file

Fri Aug 31 01:49:19 CEST 2012

On 30/08/12 23:19, Abhishek Pratap wrote:

> I am wondering how can I go about reading data from this at a faster
> pace and then farm out the jobs to worker function using
> multiprocessing module.
>
> I can think of two ways.
>
> 1. split the split and read it in parallel(dint work well for me )
> primarily because I dont know how to read a file in parallel
> efficiently.

Can you show us what you tried? It's always easier to give an answer to 
a concrete example than to a hypethetical scenario.

> 2. keep reading the file sequentially into a buffer of some size and
> farm out a chunks of the data through multiprocessing.

This is the model I've used. In pseudo code

for line, data in enumerate(file):
    while line % chunksize:
        chunk.append(data)
    launch_subprocess(chunk)

I'd tend to go for big chunks - if you have a million lines in your file 
I'd pick a chunksize of around 10,000-100,000 lines. If you go too small 
the overhead of starting the subprocess will swamp any gains
you get. Also remember the constraints of how many actual CPUs/Cores you 
have. Too many tasks spread over too few CPUs will just cause more 
swapping. Any less than 4 cores is probably not worth the effort. Just 
maximise the efficiency of your algorithm - which is probably worth 
doing first anyway.

HTH,
-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/