[Python-ideas] [Python-Dev] minmax() function returning (minimum, maximum) tuple of a sequence
Lie Ryan
lie.1296 at gmail.com
Wed Oct 27 07:19:31 CEST 2010
On 10/26/10 21:10, Steven D'Aprano wrote:
> def multi_reduce(iterable, funcs):
> it = iter(iterable)
> collectors = [next(it)]*len(funcs)
> for i, f in enumerate(funcs):
> x = next(it)
> collectors[i] = f(collectors[i], x)
> return collectors
>
> I've called it multi_reduce rather than parallel_reduce, because it
> doesn't execute the functions in parallel. By my testing on Python
> 3.1.1, multi_s designed for functions with the signature `func([object])` (a function that takes, as argument, a list of objects). I believereduce is consistently ~30% faster than the generator based
> solution for lists with 1000 - 10,000,000 items.
>
> So what am I missing? What does your parallel_reduce give us that
> multi_reduce doesn't?
The parallel_reduce() is specifically designed for for functions with
the signature `func([object])` (a function that takes, as argument, a
list of objects). The idea is that, you can write your func()
iteratively, and parallel_reduce() will somehow handle splitting work
into multiple funcs, as if you tee() the iterator, but without caching
the whole iterable.
Maybe max and min is a bad example, as it happens to be the case that
max and min have the alternative signature `func(int, int)` which makes
it a better fit with the traditional reduce() approach (and as it
happens to be, parallel_reduce() seems to be a bad name as well, since
it's not related to the traditional reduce() in any way).
And I believe you do miss something:
>>> multi_reduce([1, 2, 3], [max, min])
[2, 1]
>>> parallel_reduce([1, 2, 3], [max, min])
[3, 1]
I don't think that my original attempt with threading is an ideal
solution either, as Guido stated, it's too complicated for such a simple
problem. cProfile even shows that 30% of the its time is spent waiting
on acquiring locks. The ideal solution would probably require a way for
a function to interrupt its own execution (when the teed iterator is
exhausted, but there is still some item in iterable), let other part of
code continues (the iterator feeder, and other funcs), and then resume
where it was left off (which is why I think cofunction is probably the
way to go, assuming I understand correctly what cofunction is).
In diagram:
Initially, producer creates a few funcs, and feeds them a suspendable
teed-iterators:
+--------+ +------------+
true iterator +---| func_1 |---| iterator_1 |--[1, ...]
[1, 2, ..] | +--------+ +------------+
| |
+**********+ | +--------+ +------------+
* producer *----+---| func_2 |---| iterator_2 |--[1, ...]
+**********+ | +--------+ +------------+
|
| +--------+ +------------+
+---| func_3 |---| iterator_3 |--[1, ...]
+--------+ +------------+
First, func_1 is executed, and iterator_1 produce item 1:
+********+ +************+
true iterator +---* func_1 *---* iterator_1 *--[*1*, ...]
[*1*, 2, ..] | +********+ +************+
| |
+----------+ | +--------+ +------------+
| producer |----+---| func_2 |---| iterator_2 |--[1, ...]
+----------+ | +--------+ +------------+
|
| +--------+ +------------+
+---| func_3 |---| iterator_3 |--[1, ...]
+--------+ +------------+
then iterator_1 suspends execution, giving control back to producer:
+========+ +============+
true iterator +---| func_1 |---| iterator_1 |--[...]
[*1*, 2, ..] | +========+ +============+
| |
+**********+ | +--------+ +------------+
* producer *----+---| func_2 |---| iterator_2 |--[1, ...]
+**********+ | +--------+ +------------+
|
| +--------+ +------------+
+---| func_3 |---| iterator_3 |--[1, ...]
+--------+ +------------+
Then, producer give execution to func_2:
+========+ +============+
true iterator +---| func_1 |---| iterator_1 |--[...]
[*1*, 2, ..] | +========+ +============+
| |
+----------+ | +********+ +************+
| producer |----+---* func_2 *---* iterator_2 *--[*1*, ...]
+----------+ | +********+ +************+
|
| +--------+ +------------+
+---| func_3 |---| iterator_3 |--[1, ...]
+--------+ +------------+
func_2 processes item 1, then iterator_2 suspends and give control back
to producer:
+========+ +============+
true iterator +---| func_1 |---| iterator_1 |--[...]
[*1*, 2, ..] | +========+ +============+
| |
+**********+ | +========+ +============+
* producer *----+---| func_2 |---| iterator_2 |--[...]
+**********+ | +========+ +============+
|
| +--------+ +------------+
+---| func_3 |---| iterator_3 |--[1, ...]
+--------+ +------------+
and now it's func_3's turn:
+========+ +============+
true iterator +---| func_1 |---| iterator_1 |--[...]
[*1*, 2, ..] | +========+ +============+
| |
+----------+ | +========+ +============+
| producer |----+---| func_2 |---| iterator_2 |--[...]
+----------+ | +========+ +============+
|
| +********+ +************+
+---* func_3 *---* iterator_3 *--[*1*, ...]
+********+ +************+
func_3 processes item 1, then iterator_3 suspends and give control back
to producer:
+========+ +============+
true iterator +---| func_1 |---| iterator_1 |--[...]
[*1*, 2, ..] | +========+ +============+
| |
+**********+ | +========+ +============+
* producer *----+---| func_2 |---| iterator_2 |--[...]
+**********+ | +========+ +============+
|
| +========+ +============+
+---| func_3 |---| iterator_3 |--[...]
+========+ +============+
all funcs already consumed item 1, so producer advances (next()-ed)the
"true iterator", and feeds it to the teed-iterator.
+========+ +============+
true iterator +---| func_1 |---| iterator_1 |--[2, ...]
[*2*, 3, ..] | +========+ +============+
| |
+**********+ | +========+ +============+
* producer *----+---| func_2 |---| iterator_2 |--[2, ...]
+**********+ | +========+ +============+
|
| +========+ +============+
+---| func_3 |---| iterator_3 |--[2, ...]
+========+ +============+
then producer resumes func_1, and it processes item 2:
+********+ +************+
true iterator +---* func_1 *---* iterator_1 *--[*2*, ...]
[*2*, 3, ..] | +********+ +************+
| |
+----------+ | +========+ +============+
| producer |----+---| func_2 |---| iterator_2 |--[2, ...]
+----------+ | +========+ +============+
|
| +========+ +============+
+---| func_3 |---| iterator_3 |--[2, ...]
+========+ +============+
then the same thing happens to func_2 and func_3; and repeat this until
the "true iterator" is exhausted. When the true iterator is exhausted,
producer signals iterator_1, iterator_2, and iterator_3 so they raises
StopIteration, causing func_1, func_2, and func_3 to return a result.
And producer collects the result into a list and return to the result to
its caller.
Basically, it is a form of cooperative multithreading where iterator_XXX
(instead of func_xxx) decides when to suspend the execution of func_XXX
(in this particular case, when its own cache is exhausted, but there is
still some item in the true iterator).
The advantage is that func_1, func_2, and func_3 can be written
iteratively (i.e. as func([object])), as opposed to reduce-like
approach. If performance is important, iterator_xxx can feed multiple
items to func_xxx before suspending. Also, it should require no locking
as object sharing and suspending execution is controlled by iterator_xxx
(instead of the indeterministic preemptive threading).
More information about the Python-ideas
mailing list