[Python-ideas] minmax() function returning (minimum, maximum) tuple of a sequence

Steven D'Aprano steve at pearwood.info
Thu Oct 14 13:23:31 CEST 2010

```On Thu, 14 Oct 2010 08:54:31 am you wrote:

> After some thought, I've found a way to make running several "running
> calculations" in parallel fast. Speed should be comparable to having
> used the non-running variants.

Speed "should be" comparable? Are you guessing or have you actually
timed it?

And surely the point of all this extra coding is to make something run
*faster*, not "comparable to", the sequential algorithm?

> The method is to give each running calculation "blocks" of values
> instead of just one at a time. The apply_in_parallel(iterable,
> block_size=1000, *running_calcs) function would get blocks of values
> from the iterable and pass them to each running calculation
> separately. So RunningMax would look something like this:
>
> class RunningMax(RunningCalc):
>     def __init__(self):
>         self.max_value = None
>
>     def feed(self, value):
>         if self.max_value is None or value > self.max_value:
>             self.max_value = value
>
>     def feedMultiple(self, values):
>         self.feed(max(values))

As I understand it, you have a large list of data, and you want to
calculate a number of statistics on it. The naive algorithm does each
calculation sequentially:

a = min(data)
b = max(data)
c = f(data)  # some other statistics
d = g(data)
...
x = z(data)

If the calculations take t1, t2, t3, ..., tx time, then the sequential
calculation takes sum(t1, t2, ..., tx) plus a bit of overhead. If you
do it in parallel, this should reduce the time to max(t1, t2, ..., tx)
plus a bit of overhead, potentially a big saving.

But unless I've missed something significant, all you are actually doing
is slicing data up into small pieces, then calling each function min,
max, f, g, ..., z on each piece sequentially:

block = data[:size]
a = min(block)
b = max(block)
c = f(block)
...
block = data[size:2*size]
a = min(a, min(block))
b = max(b, max(block))
c = f(c, f(block))
...
block = data[2*size:3*size]
a = min(a, min(block))
b = max(b, max(block))
c = f(c, f(block))
...

Each function still runs sequentially, but you've increased the amount
of overhead a lot: your slicing and dicing the data, plus calling each
function multiple times.

And since each "running calculation" class needs to be hand-written to
suit the specifics of the calculation, that's a lot of extra work just
to get something which I expect will run slower than the naive
sequential algorithm.

I'm also distracted by the names, RunningMax and RunningCalc. RunningFoo
doesn't mean "do Foo in parallel", it means to return intermediate
calculations. For example, if I ran a function called RunningMax on
this list:

[1, 2, 1, 5, 7, 3, 4, 6, 8, 6]

I would expect it to yield or return:

[1, 2, 5, 7, 8]

--
Steven D'Aprano

```