[Tutor] multiprocessing question

Fri Nov 28 12:36:45 CET 2014

On 11/28/2014 05:53 AM, Albert-Jan Roskam wrote:
>
>
> ----- Original Message -----
>
>> From: Dave Angel <davea at davea.name>
>> To: tutor at python.org
>> Cc:
>> Sent: Thursday, November 27, 2014 11:55 PM
>> Subject: Re: [Tutor] multiprocessing question
>>
>> On 11/27/2014 04:01 PM, Albert-Jan Roskam wrote:
>>>
>>>
>>>
>>>
>>>
>>>   I made a comparison between multiprocessing and threading.  In the code
>> below (it's also here: http://pastebin.com/BmbgHtVL, multiprocessing is more
>> than 100 (yes: one hundred) times slower than threading! That is
>> I-must-be-doing-something-wrong-ishly slow. Any idea whether I am doing
>> something wrong? I can't believe the difference is so big.
>>>
>>>
>>
>> The bulk of the time is spent marshalling the data to the dictionary
>> self.lookup.  You can speed it up some by using a list there (it also
>> makes the code much simpler).  But the real trick is to communicate less
>> often between the processes.
>>
>>       def mp_create_lookup(self):
>>           local_lookup = []
>>           lino, record_start = 0, 0
>>           for line in self.data:
>>               if not line:
>>                   break
>>               local_lookup.append(record_start)
>>               if len(local_lookup) > 100:
>>                   self.lookup.extend(local_lookup)
>>                   local_lookup = []
>>               record_start += len(line)
>>           print(len(local_lookup))
>>           self.lookup.extend(local_lookup)
>>
>> It's faster because it passes a larger list across the boundary every
>> 100 records, instead of a single value every record.
>>
>> Note that the return statement wasn't ever needed, and you don't need a
>> lino variable.  Just use append.
>>
>> I still have to emphasize that record_start is just wrong.  You must use
>> ftell() if you're planning to use fseek() on a text file.
>>
>> You can also probably speed the process up  a good deal by passing the
>> filename to the other process, rather than opening the file in the
>> original process.  That will eliminate sharing the self.data across the
>> process boundary.
>
>
> Hi Dave,
>
> Thanks. I followed your advice and this indeed makes a huuuge difference. Multiprocessing is now just 3 times slower than threading.

And I'd bet you could close most of that gap by opening the file in the 
subprocess instead of marshalling the file I/O across the boundary.

> Even so, threading is still the way to go (also because of the added complexity of the mp_create_lookup function).
>
> Threading/mp aside: I agree that a dict is not the right choice. I consider a dict like a mix between a Ferrari
 > and a Mack truck: fast, but bulky. Would it make sense to use 
array.array instead of list?

Sure.  The first trick for performance is to pick a structure that's 
just complex enough to solve your problem.  Since your keys are 
sequential integers, list makes more sense than dict.  If all your keys 
are 4gig or less, then an array.array makes sense.  But each time you 
make such a simplification, you are usually adding an assumption.

I've been treating this as an academic exercise, to help expose some of 
the tradeoffs.  But as you've already pointed out, the real reason to 
use threads is to simplify the code.  The fact that it's faster is just 
gravy.  The main downside to threads is it's way too easy to 
accidentally use a global, and not realize how the threads are interacting.

Optimizing is fun:

So are these csv files pretty stable?  If so, you could prepare an index 
file to each one, and only recalculate if the timestamp changes.  That 
index could be anything you like, and it could be fixed length binary 
data, so random access in it is trivial.

Are the individual lines always less than 255 bytes?  if so, you could 
index every 100 lines in a smaller array.arry, and for the individual 
line sizes use a byte array.  You've saved another factor of 4.

-- 
DaveA