[Python-ideas] Where-statement (Proposal for function expressions)

Steven D'Aprano steve at pearwood.info
Thu Jul 16 19:13:18 CEST 2009


On Fri, 17 Jul 2009 12:51:54 am Chris Perkins wrote:

> Here's another example - how do you calculate sample variance? You're
> probably thinking "Um, well, I think it's the square root of the sum
> of the squares minus the square of the sums".  Great, that's a  
> high-ish level, intuitive description of the algorithm. The precise
> details of how to go about calculating that are not important yet. So
> write it that way:
>
> def variance(data):
>     return sqrt(sum_of_squares - square_of_sums) where:
>         ... elided ...

That would actually be the standard deviation, not the variance.

> There - we've written the core of the algorithm at the same level of
> abstraction as we think about it. The code matches our mental model.
> Only then do we go about filling in the boring details.
> 
> def variance(data):
>     return sqrt(sum_of_squares - square_of_sums) where:
>         def sqrt(v): return v ** 0.5
>         sum_of_squares = sum(v**2 for v in data)
>         square_of_sums = sum(data) ** 2


I don't know what you've calculated, but it's not the variance, or even 
the standard deviation. The variance is actually the mean of the 
squares of the deviations from the mean of the data. Written another 
way, that's the mean of the squares of the data minus the square of the 
mean. But putting all that aside, let's pretend that the formula you 
gave is correct. It's certainly calculating *something* -- let's 
pretend it's the variance.


I'm now going to try to convince you that this entire approach is 
actually harmful and should be avoided. You said:

"...sum of the squares minus the square of the sums".

That's all very well, but what squares? What sums? You have 
under-specified what the function should do. Is it squares of the first 
fifty prime numbers? Sums of odd-positioned data? You can't just 
pretend that this is an implementation detail -- you have to face up to 
what the function actually does at the design phase.

If you had, you would have noted that the so-called "variance" is the 
sum of the squares *of the data*, minus the square of the sums *of the 
data*. The sums-and-squares are functions of data, not expressions or 
constants, which suggests writing them as functions.

This gives us:

def variance(data):  # for some definition of "variance"
    return sqrt(sum_of_squares(data) - square_of_sums(data)) where:
        ...

This naturally suggests that they are (or could be) reusable functions 
that belong in the module namespace, not repeated inside each function 
that uses them:

def sqrt(v):
    return v**0.5

def sum_of_squares(data):
    return sum(v**2 for v in data)

def square_of_sums(data):
    return sum(data)**2

These can now easily be documented and tested, which is a major win.

Having done this, the "where" statement is redundant:

def variance(data):
    return sqrt(sum_of_squares(data) - square_of_sums(data))


Not just redundant, but actively harmful: it encourages the coder to 
duplicate common code inside functions instead of factoring it out into 
a single external function. This not only violates Don't Repeat 
Yourself, but it makes it harder to test and document the code.



Now of course it's possible to inappropriately inline common code inside 
functions without "where". For instance, I might have written:

def variance(data):
    def sqrt(v):
        return v**0.5
    def sum_of_squares(data):
        return sum(v**2 for v in data)
    def square_of_sums(data):
        return sum(data)**2
    return sqrt(sum_of_squares(data) - square_of_sums(data))

which is a micro-pessimation (not only do I lose the opportunity to 
re-use and test the internal functions, but I also pay the micro-cost 
of re-creating them every time I call the function).

Or:

def variance(data):
    sum_of_squares = sum(v**2 for v in data)
    square_of_sums = sum(data)**2
    return (sum_of_squares - square_of_sums)**0.5


In either case, writing the sums-and-squares before the return suggests 
that these are common code best written as re-usable functions. 
But "where" changes the focus of the coder away from factoring code 
*out* of functions into placing code *inside* internal blocks. It 
encourages the coder to think of sum_of_squares as a private 
implementation detail instead of a re-usable component. And that is, I 
believe, harmful.



-- 
Steven D'Aprano



More information about the Python-ideas mailing list