[Tutor] multiprocessing question

Dave Angel davea at davea.name
Thu Nov 27 23:55:55 CET 2014

On 11/27/2014 04:01 PM, Albert-Jan Roskam wrote:
> I made a comparison between multiprocessing and threading.  In the code below (it's also here: http://pastebin.com/BmbgHtVL, multiprocessing is more than 100 (yes: one hundred) times slower than threading! That is I-must-be-doing-something-wrong-ishly slow. Any idea whether I am doing something wrong? I can't believe the difference is so big.

The bulk of the time is spent marshalling the data to the dictionary 
self.lookup.  You can speed it up some by using a list there (it also 
makes the code much simpler).  But the real trick is to communicate less 
often between the processes.

     def mp_create_lookup(self):
         local_lookup = []
         lino, record_start = 0, 0
         for line in self.data:
             if not line:
             if len(local_lookup) > 100:
                 local_lookup = []
             record_start += len(line)

It's faster because it passes a larger list across the boundary every 
100 records, instead of a single value every record.

Note that the return statement wasn't ever needed, and you don't need a 
lino variable.  Just use append.

I still have to emphasize that record_start is just wrong.  You must use 
ftell() if you're planning to use fseek() on a text file.

You can also probably speed the process up  a good deal by passing the 
filename to the other process, rather than opening the file in the 
original process.  That will eliminate sharing the self.data across the 
process boundary.


More information about the Tutor mailing list