[Python-ideas] minmax() function returning (minimum, maximum) tuple of a sequence

Steven D'Aprano steve at pearwood.info
Thu Oct 14 13:23:31 CEST 2010


On Thu, 14 Oct 2010 08:54:31 am you wrote:

> After some thought, I've found a way to make running several "running
> calculations" in parallel fast. Speed should be comparable to having
> used the non-running variants.

Speed "should be" comparable? Are you guessing or have you actually 
timed it?

And surely the point of all this extra coding is to make something run 
*faster*, not "comparable to", the sequential algorithm?


> The method is to give each running calculation "blocks" of values
> instead of just one at a time. The apply_in_parallel(iterable,
> block_size=1000, *running_calcs) function would get blocks of values
> from the iterable and pass them to each running calculation
> separately. So RunningMax would look something like this:
>
> class RunningMax(RunningCalc):
>     def __init__(self):
>         self.max_value = None
>
>     def feed(self, value):
>         if self.max_value is None or value > self.max_value:
>             self.max_value = value
>
>     def feedMultiple(self, values):
>         self.feed(max(values))

As I understand it, you have a large list of data, and you want to 
calculate a number of statistics on it. The naive algorithm does each 
calculation sequentially:

a = min(data)
b = max(data)
c = f(data)  # some other statistics
d = g(data)
...
x = z(data)

If the calculations take t1, t2, t3, ..., tx time, then the sequential 
calculation takes sum(t1, t2, ..., tx) plus a bit of overhead. If you 
do it in parallel, this should reduce the time to max(t1, t2, ..., tx) 
plus a bit of overhead, potentially a big saving.

But unless I've missed something significant, all you are actually doing 
is slicing data up into small pieces, then calling each function min, 
max, f, g, ..., z on each piece sequentially:

block = data[:size]
a = min(block)
b = max(block)
c = f(block)
...
block = data[size:2*size]
a = min(a, min(block))
b = max(b, max(block))
c = f(c, f(block))
...
block = data[2*size:3*size]
a = min(a, min(block))
b = max(b, max(block))
c = f(c, f(block))
...

Each function still runs sequentially, but you've increased the amount 
of overhead a lot: your slicing and dicing the data, plus calling each 
function multiple times.

And since each "running calculation" class needs to be hand-written to 
suit the specifics of the calculation, that's a lot of extra work just 
to get something which I expect will run slower than the naive 
sequential algorithm.


I'm also distracted by the names, RunningMax and RunningCalc. RunningFoo 
doesn't mean "do Foo in parallel", it means to return intermediate 
calculations. For example, if I ran a function called RunningMax on 
this list:

[1, 2, 1, 5, 7, 3, 4, 6, 8, 6]

I would expect it to yield or return:

[1, 2, 5, 7, 8]



-- 
Steven D'Aprano



More information about the Python-ideas mailing list