[Tutor] using multiprocessing efficiently to process large data file
Alan Gauld
alan.gauld at btinternet.com
Fri Aug 31 01:49:19 CEST 2012
On 30/08/12 23:19, Abhishek Pratap wrote:
> I am wondering how can I go about reading data from this at a faster
> pace and then farm out the jobs to worker function using
> multiprocessing module.
>
> I can think of two ways.
>
> 1. split the split and read it in parallel(dint work well for me )
> primarily because I dont know how to read a file in parallel
> efficiently.
Can you show us what you tried? It's always easier to give an answer to
a concrete example than to a hypethetical scenario.
> 2. keep reading the file sequentially into a buffer of some size and
> farm out a chunks of the data through multiprocessing.
This is the model I've used. In pseudo code
for line, data in enumerate(file):
while line % chunksize:
chunk.append(data)
launch_subprocess(chunk)
I'd tend to go for big chunks - if you have a million lines in your file
I'd pick a chunksize of around 10,000-100,000 lines. If you go too small
the overhead of starting the subprocess will swamp any gains
you get. Also remember the constraints of how many actual CPUs/Cores you
have. Too many tasks spread over too few CPUs will just cause more
swapping. Any less than 4 cores is probably not worth the effort. Just
maximise the efficiency of your algorithm - which is probably worth
doing first anyway.
HTH,
--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
More information about the Tutor
mailing list