[Python-ideas] minmax() function returning (minimum, maximum) tuple of a sequence

Fri Oct 15 17:36:19 CEST 2010

On Fri, Oct 15, 2010 at 5:12 AM, Steven D'Aprano wrote:
> On Thu, 14 Oct 2010 11:05:25 pm you wrote:
>> The use-case I'm targeting is when you can't hold all of the data in
>> memory, and it is relatively "expensive" to generate it, e.g. a large
>> and complex database query. In this case just running the sequential
>> functions one at a time requires generating the data several times,
>> once per function. My goal is to facilitate running several
>> computations on a single iterator without keeping all of the data in
>> memory.
>
> Okay, fair enough, but I think that's enough of a specialist need that
> it doesn't belong as a built-in or even in the standard library.

I don't see this as a specialist need. This is relevant to any piece
of code which receives an iterator and doesn't know whether it is
feasible to keep all of its items in memory. The way I see it,
Python's embracing of iterators is what makes this commonly useful.

> I suspect that, even for your application, a more sensible approach
> would be to write a single function to walk over the data once, doing
> all the calculations you need. E.g. if your data is numeric, and you
> need (say) the min, max, mean (average), standard deviation and
> standard error, rather than doing a separate pass for each function,
> you can do them all in a single pass:
>
> sum = 0
> sum_sq = 0
> count = 0
> smallest = sys.maxint
> biggest = -sys.maxint
> for x in data:
>    count += 1
>    sum += x
>    sum_sq += x**2
>    smallest = min(smallest, x)
>    biggest = max(biggest, x)
> mean = sum/count
> std_dev = math.sqrt((sum_sq + sum**2)/(count-1))
> std_err = std_dev/math.sqrt(count)

What you suggest is that each programmer rolls his own code, which is
reasonable for tasks which are not very common and are easy enough to
implement. The problem is that in this case, the straightforward
solution you suggest has both efficiency and numerical stability
problems. These are actually quite tricky to understand and sort out.
In light of this, a standard implementation which avoids common
stumbling blocks and errors could have its place in the standard
library. IIRC these were the reasons for the inclusion of the bisect
module, for example.

Regarding the numerical stability issues, these don't arise just in
extreme edge-cases. Even a simple running average calculation for some
large numbers, or numbers whose average is near zero, can have
significant errors. Variance and standard deviation are even more
problematic in this respect.

> Naturally, if you don't know what functions you need to call until
> runtime it will require a bit more cleverness. A general approach might
> be a functional approach based on reduce:
>
> def multireduce(functions, initial_values, data):
>    values = list(initial_values)
>    for x in data:
>        for i, func in enumerate(functions):
>            values[i] = func(x, values[i])
>    return values
>
> The point is that if generating the data is costly, the best approach is
> to lazily generate the data once only, with the minimal overhead and
> maximum flexibility.

This is precisely what I am suggesting! The only difference is that I
suggest using objects with a simple API instead of functions, to allow
more flexibility. Some things are hard to implement using just a
function as you suggest, and various optimizations are impossible.

- Tal Einat