[Tutor] using multiprocessing efficiently to process large data file

Sun Sep 2 08:41:51 CEST 2012

On 02/09/12 06:48, eryksun wrote:

>
>      from multiprocessing import Pool, cpu_count
>      from itertools import izip_longest, imap
>
>      FILE_IN = '...'
>      FILE_OUT = '...'
>
>      NLINES = 1000000 # estimate this for a good chunk_size
>      BATCH_SIZE = 8
>
>      def func(batch):
>          """ test func """
>          import os, time
>          time.sleep(0.001)
>          return "%d: %s\n" % (os.getpid(), repr(batch))
>
>      if __name__ == '__main__': # <-- required for Windows

Why?
What difference does that make in Windows?

>          file_in, file_out = open(FILE_IN), open(FILE_OUT, 'w')
>          nworkers = cpu_count() - 1
>
>          with file_in, file_out:
>              batches = izip_longest(* [file_in] * BATCH_SIZE)
>              if nworkers > 0:
>                  pool = Pool(nworkers)
>                  chunk_size = NLINES // BATCH_SIZE // nworkers
>                  result = pool.imap(func, batches, chunk_size)
>              else:
>                  result = imap(func, batches)
>              file_out.writelines(result)

just curious.

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/