[Tutor] multiprocessing question
Dave Angel
davea at davea.name
Thu Nov 27 23:55:55 CET 2014
On 11/27/2014 04:01 PM, Albert-Jan Roskam wrote:
>
>
>
>
>
> I made a comparison between multiprocessing and threading. In the code below (it's also here: http://pastebin.com/BmbgHtVL, multiprocessing is more than 100 (yes: one hundred) times slower than threading! That is I-must-be-doing-something-wrong-ishly slow. Any idea whether I am doing something wrong? I can't believe the difference is so big.
>
>
The bulk of the time is spent marshalling the data to the dictionary
self.lookup. You can speed it up some by using a list there (it also
makes the code much simpler). But the real trick is to communicate less
often between the processes.
def mp_create_lookup(self):
local_lookup = []
lino, record_start = 0, 0
for line in self.data:
if not line:
break
local_lookup.append(record_start)
if len(local_lookup) > 100:
self.lookup.extend(local_lookup)
local_lookup = []
record_start += len(line)
print(len(local_lookup))
self.lookup.extend(local_lookup)
It's faster because it passes a larger list across the boundary every
100 records, instead of a single value every record.
Note that the return statement wasn't ever needed, and you don't need a
lino variable. Just use append.
I still have to emphasize that record_start is just wrong. You must use
ftell() if you're planning to use fseek() on a text file.
You can also probably speed the process up a good deal by passing the
filename to the other process, rather than opening the file in the
original process. That will eliminate sharing the self.data across the
process boundary.
--
DaveA
More information about the Tutor
mailing list