[Python-ideas] minmax() function returning (minimum, maximum) tuple of a sequence

Steven D'Aprano steve at pearwood.info
Fri Oct 15 05:12:03 CEST 2010


On Thu, 14 Oct 2010 11:05:25 pm you wrote:
> On Thu, Oct 14, 2010 at 1:23 PM, Steven D'Aprano wrote:
> > On Thu, 14 Oct 2010 08:54:31 am you wrote:
> >> After some thought, I've found a way to make running several
> >> "running calculations" in parallel fast. Speed should be
> >> comparable to having used the non-running variants.
> >
> > Speed "should be" comparable? Are you guessing or have you actually
> > timed it?
> >
> > And surely the point of all this extra coding is to make something
> > run *faster*, not "comparable to", the sequential algorithm?
>
> The use-case I'm targeting is when you can't hold all of the data in
> memory, and it is relatively "expensive" to generate it, e.g. a large
> and complex database query. In this case just running the sequential
> functions one at a time requires generating the data several times,
> once per function. My goal is to facilitate running several
> computations on a single iterator without keeping all of the data in
> memory.

Okay, fair enough, but I think that's enough of a specialist need that 
it doesn't belong as a built-in or even in the standard library.

I suspect that, even for your application, a more sensible approach 
would be to write a single function to walk over the data once, doing 
all the calculations you need. E.g. if your data is numeric, and you 
need (say) the min, max, mean (average), standard deviation and 
standard error, rather than doing a separate pass for each function, 
you can do them all in a single pass:

sum = 0
sum_sq = 0
count = 0
smallest = sys.maxint
biggest = -sys.maxint
for x in data:
    count += 1
    sum += x
    sum_sq += x**2
    smallest = min(smallest, x)
    biggest = max(biggest, x)
mean = sum/count
std_dev = math.sqrt((sum_sq + sum**2)/(count-1))
std_err = std_dev/math.sqrt(count)

That expression for the standard deviation is from memory, don't trust 
it, I've probably got it wrong!

Naturally, if you don't know what functions you need to call until 
runtime it will require a bit more cleverness. A general approach might 
be a functional approach based on reduce:

def multireduce(functions, initial_values, data):
    values = list(initial_values)
    for x in data:
        for i, func in enumerate(functions):
            values[i] = func(x, values[i])
    return values

The point is that if generating the data is costly, the best approach is 
to lazily generate the data once only, with the minimal overhead and 
maximum flexibility.


-- 
Steven D'Aprano



More information about the Python-ideas mailing list