[Tutor] multiprocessing question

Thu Nov 27 23:55:55 CET 2014

On 11/27/2014 04:01 PM, Albert-Jan Roskam wrote:
>
>
>
>
>
> I made a comparison between multiprocessing and threading.  In the code below (it's also here: http://pastebin.com/BmbgHtVL, multiprocessing is more than 100 (yes: one hundred) times slower than threading! That is I-must-be-doing-something-wrong-ishly slow. Any idea whether I am doing something wrong? I can't believe the difference is so big.
>
>

The bulk of the time is spent marshalling the data to the dictionary 
self.lookup.  You can speed it up some by using a list there (it also 
makes the code much simpler).  But the real trick is to communicate less 
often between the processes.

     def mp_create_lookup(self):
         local_lookup = []
         lino, record_start = 0, 0
         for line in self.data:
             if not line:
                 break
             local_lookup.append(record_start)
             if len(local_lookup) > 100:
                 self.lookup.extend(local_lookup)
                 local_lookup = []
             record_start += len(line)
         print(len(local_lookup))
         self.lookup.extend(local_lookup)

It's faster because it passes a larger list across the boundary every 
100 records, instead of a single value every record.

Note that the return statement wasn't ever needed, and you don't need a 
lino variable.  Just use append.

I still have to emphasize that record_start is just wrong.  You must use 
ftell() if you're planning to use fseek() on a text file.

You can also probably speed the process up  a good deal by passing the 
filename to the other process, rather than opening the file in the 
original process.  That will eliminate sharing the self.data across the 
process boundary.

-- 
DaveA