Re: [Python-ideas] statistics module in Python3.4

31 Jan 2014

      On Thu, Jan 30, 2014 at 08:58:20PM -0800, Larry Hastings wrote:
...
On 01/30/2014 05:27 PM, Steven D'Aprano wrote:
...
I'm hesitant to require two passes over the data in _sum. Some
higher-order statistics like variance are currently implemented using
two passes, but ultimately I've like to support single-pass algorithms
that can operate on large but finite iterators.
But I will consider it as an option.
I'm also hesitant to make the promise that _sum will be
order-independent. Addition in Python isn't: [...]
I concede that this is mostly outside my expertise, and the statistics 
module and the PEP were your doing.  So you're the expert here and I 
will defer to you.
But.  My dim understanding of the *whole point* of the new statistics 
module was that it valued correctness over raw performance.  I assumed 
sorting values from small to large** before summing was *exactly* the 
sort of thing it was written to do.  If all we wanted were Python's 
existing semantics, why bother writing statistics._sum() in the first 
place?  Just use sum().
_sum doesn't duplicate the semantics of built-in sum(). It is sort 
of a hybrid of sum and math.fsum: like sum, it tries to conserve types, 
and give a sensible result when there are mixed types. Like fsum, it 
tries to be higher precision.
...
On the other hand, I had missed the fact that this was an internal-only 
method.  If changing _statistics._sum so it reordered the iterable to 
preserve correctness wouldn't change the behavior of any supported 
external APIs, then obviously there's no need, and I'd prefer to leave 
it alone for 3.4.
Changes to _sum may be visible, because the external APIs such as mean 
and variance rely on it. For example, an extreme case: if I removed _sum 
and replaced it with math.fsum, then all of the external APIs will 
suddenly start outputting floats and nothing but floats. (I'm not 
intending to do that.)

I think that it is asking too much to promise that no statistics 
function will ever change it's numeric result. I don't intend for them 
to become *less* accurate, but they might become *more* accurate. For 
example, currently the unit tests for variance pass with an acceptable 
tolerance of 1e-12 (relative error). Perhaps this needs to be 
documented? The random module does something similar:

http://docs.python.org/3/library/random.html#notes-on-reproducibility
...
If you decided to change it for 3.5 and people were 
relying on its old behavior, that would be on them.  (Though a comment 
saying "I might change this later" would be welcome... if true.)
...
If you're worried about people coming to rely on this, and thus running
into trouble in the future if Counters get treated specially for (say)
weighted data, then I'd accept a warning in the docs, or even a runtime
warning. But not an exception.
The statistics module isn't marked as provisional.  So the semantics 
that ship with 3.4 are going to be set in stone.  Changing them later 
simply won't be an option--that will break code.  If you want to treat 
Counter objects differently in the future than you do now, then I agree 
with Wolfgang: the best course of action would be to add an exception 
now.  But again I'll defer to your judgment about what's best for your 
module.
Hmmm. Well, that's a much stronger promise of backward compatibility 
than I would have expected. The fact that (say) variance works with a 
dict is a pure accident of implementation, not advertised or promised in 
any way. But I'll accept your ruling. I want to reserve the right to 
add special handling of mappings in the future. In order of preference 
(highest to least) I'd like to:

1) Put a note in the documentation that handling of mappings is subject 
to change;

2) As above, plus raise warning.warn(); or

3) Raise an exception (this one only if you insist).

-- 
Steven