[Tutor] multiprocessing question
Albert-Jan Roskam
fomcl at yahoo.com
Fri Nov 28 11:53:17 CET 2014
----- Original Message -----
> From: Dave Angel <davea at davea.name>
> To: tutor at python.org
> Cc:
> Sent: Thursday, November 27, 2014 11:55 PM
> Subject: Re: [Tutor] multiprocessing question
>
> On 11/27/2014 04:01 PM, Albert-Jan Roskam wrote:
>>
>>
>>
>>
>>
>> I made a comparison between multiprocessing and threading. In the code
> below (it's also here: http://pastebin.com/BmbgHtVL, multiprocessing is more
> than 100 (yes: one hundred) times slower than threading! That is
> I-must-be-doing-something-wrong-ishly slow. Any idea whether I am doing
> something wrong? I can't believe the difference is so big.
>>
>>
>
> The bulk of the time is spent marshalling the data to the dictionary
> self.lookup. You can speed it up some by using a list there (it also
> makes the code much simpler). But the real trick is to communicate less
> often between the processes.
>
> def mp_create_lookup(self):
> local_lookup = []
> lino, record_start = 0, 0
> for line in self.data:
> if not line:
> break
> local_lookup.append(record_start)
> if len(local_lookup) > 100:
> self.lookup.extend(local_lookup)
> local_lookup = []
> record_start += len(line)
> print(len(local_lookup))
> self.lookup.extend(local_lookup)
>
> It's faster because it passes a larger list across the boundary every
> 100 records, instead of a single value every record.
>
> Note that the return statement wasn't ever needed, and you don't need a
> lino variable. Just use append.
>
> I still have to emphasize that record_start is just wrong. You must use
> ftell() if you're planning to use fseek() on a text file.
>
> You can also probably speed the process up a good deal by passing the
> filename to the other process, rather than opening the file in the
> original process. That will eliminate sharing the self.data across the
> process boundary.
Hi Dave,
Thanks. I followed your advice and this indeed makes a huuuge difference. Multiprocessing is now just 3 times slower than threading. Even so, threading is still the way to go (also because of the added complexity of the mp_create_lookup function).
Threading/mp aside: I agree that a dict is not the right choice. I consider a dict like a mix between a Ferrari and a Mack truck: fast, but bulky. Would it make sense to use array.array instead of list? I also checked numpy.array, but numpy.append is very ineffcient (reminded me of str.__iadd__). This site suggests that it could make a huge difference in terms of RAM use: http://www.dotnetperls.com/array-python. "The array with 10 million integers required 43.8 MB of memory. The list version required 710.9 MB." (note that is it is followed by a word of caution)
Albert-Jan
More information about the Tutor
mailing list