[Tutor] multiprocessing question

Fri Nov 28 11:53:17 CET 2014

----- Original Message -----

> From: Dave Angel <davea at davea.name>
> To: tutor at python.org
> Cc: 
> Sent: Thursday, November 27, 2014 11:55 PM
> Subject: Re: [Tutor] multiprocessing question
> 
> On 11/27/2014 04:01 PM, Albert-Jan Roskam wrote:
>> 
>> 
>> 
>> 
>> 
>>  I made a comparison between multiprocessing and threading.  In the code 
> below (it's also here: http://pastebin.com/BmbgHtVL, multiprocessing is more 
> than 100 (yes: one hundred) times slower than threading! That is 
> I-must-be-doing-something-wrong-ishly slow. Any idea whether I am doing 
> something wrong? I can't believe the difference is so big.
>> 
>> 
> 
> The bulk of the time is spent marshalling the data to the dictionary 
> self.lookup.  You can speed it up some by using a list there (it also 
> makes the code much simpler).  But the real trick is to communicate less 
> often between the processes.
> 
>      def mp_create_lookup(self):
>          local_lookup = []
>          lino, record_start = 0, 0
>          for line in self.data:
>              if not line:
>                  break
>              local_lookup.append(record_start)
>              if len(local_lookup) > 100:
>                  self.lookup.extend(local_lookup)
>                  local_lookup = []
>              record_start += len(line)
>          print(len(local_lookup))
>          self.lookup.extend(local_lookup)
> 
> It's faster because it passes a larger list across the boundary every 
> 100 records, instead of a single value every record.
> 
> Note that the return statement wasn't ever needed, and you don't need a 
> lino variable.  Just use append.
> 
> I still have to emphasize that record_start is just wrong.  You must use 
> ftell() if you're planning to use fseek() on a text file.
> 
> You can also probably speed the process up  a good deal by passing the 
> filename to the other process, rather than opening the file in the 
> original process.  That will eliminate sharing the self.data across the 
> process boundary.

Hi Dave,

Thanks. I followed your advice and this indeed makes a huuuge difference. Multiprocessing is now just 3 times slower than threading. Even so, threading is still the way to go (also because of the added complexity of the mp_create_lookup function).

Threading/mp aside: I agree that a dict is not the right choice. I consider a dict like a mix between a Ferrari and a Mack truck: fast, but bulky. Would it make sense to use array.array instead of list? I also checked numpy.array, but numpy.append is very ineffcient (reminded me of str.__iadd__). This site suggests that it could make a huge difference in terms of RAM use: http://www.dotnetperls.com/array-python. "The array with 10 million integers required 43.8 MB of memory. The list version required 710.9 MB."  (note that is it is followed by a word of caution)

Albert-Jan