[Python-ideas] thoughts on the new 3.4 statistics module

Sat Dec 21 23:29:14 CET 2013

First of all: thank you, Steven and everyone else involved, for taking on 
the task of starting to implement this long-missed (at least by me) feature 
!
I really hope the module will be a success and grow over time.
I have two thoughts at the moment about the implementation that I think may 
be worth discussing, if it hasn't happened yet (I have to admit I did not 
go through all previous posts on this topic, only read the PEP):

First: I am not entirely convinced by when the module raises Errors. In 
some places its undoubtedly justified to raise StatisticsError (like when 
empty sequences are passed to mean()).
On the other hand, should there really be an error, when for example no 
unique value for the mode can be found?
Effectively, that would force users to guard every (!) call to the function 
with try/except. In my opinion, a better choice would be to return 
float('nan') or even better a module-specific object (call it Undefined or 
something) that one can check for. This behavior could, in general, be 
implemented for cases, where input can actually be handled and a result be 
calculated (like a list of values in the mode example), but this result is 
considered "undefined" by the algorithm.

Second: I am not entirely happy with the three different flavors of the 
median function. I *do* know that this has been discussed before, but I'm 
not sure whether *all* alternatives have been considered (the PEP only 
talks about the median.low, median.high syntax, which, in fact, I wouldn't 
like that much either. My suggestion would be to have a resolve parameter, 
by which the behavior of a single median function can be modified.
My main argument here is that as the module will grow in the future there 
will be many more such situations, in which different ways of calculating 
statistics are all totally acceptable and you would want to leave the 
choice to the user (the mode function can already be considered as an 
example: maybe the user would want to have the list of "modes" returned in 
case that no unambiguous value can be calculated; actually the current code 
seems to be prepared for later implementation of this feature because it 
does generate the list, just is not returning it). Now if, in all such 
situations, the solution is to have extra functions the module will soon 
end up completely cluttered with them. If, on the other hand, every 
function that will foreseeably have to handle ambiguous situations had a 
resolve parameter the module structure would be much clearer. In the median 
example you would then call median(data) for the default behavior, arguably 
the interpolation, but median(data, resolve='low') or median(data, 
resolve='high') for the alternative calculations. Statistically educated 
users could then guess, relatively easily, which functions have the resolve 
parameter and a quick look at the function's help could tell them, which 
arguments are accepted.

Finally, let me just point out that these are really just first thoughts 
and I do understand that these are design decisions about which different 
people will have different opinions, but I think now is still a good time 
to discuss them, while with an established and (hopefully :) ) much larger 
module you will not be able to change things that easily anymore.

Hoping for a lively discussion,
Wolfgang

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20131221/598d40a5/attachment.html>