[Tutor] using multiprocessing efficiently to process large data file
Alan Gauld
alan.gauld at btinternet.com
Sun Sep 2 08:41:51 CEST 2012
On 02/09/12 06:48, eryksun wrote:
>
> from multiprocessing import Pool, cpu_count
> from itertools import izip_longest, imap
>
> FILE_IN = '...'
> FILE_OUT = '...'
>
> NLINES = 1000000 # estimate this for a good chunk_size
> BATCH_SIZE = 8
>
> def func(batch):
> """ test func """
> import os, time
> time.sleep(0.001)
> return "%d: %s\n" % (os.getpid(), repr(batch))
>
> if __name__ == '__main__': # <-- required for Windows
Why?
What difference does that make in Windows?
> file_in, file_out = open(FILE_IN), open(FILE_OUT, 'w')
> nworkers = cpu_count() - 1
>
> with file_in, file_out:
> batches = izip_longest(* [file_in] * BATCH_SIZE)
> if nworkers > 0:
> pool = Pool(nworkers)
> chunk_size = NLINES // BATCH_SIZE // nworkers
> result = pool.imap(func, batches, chunk_size)
> else:
> result = imap(func, batches)
> file_out.writelines(result)
just curious.
--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
More information about the Tutor
mailing list